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Abstract. 

Recent experimental advances in neuroscience have opened new vistas into the 
immense complexity of neuronal networks. This proliferation of data challenges us 
on two parallel fronts. First, how can we form adequate theoretical frameworks for 
understanding how dynamical network processes cooperate across widely disparate 
spatiotemporal scales to solve important computational problems? And second, how 
can we extract meaningful models of neuronal systems from high dimensional datasets? 
To aid in these challenges, we give a pedagogical review of a collection of ideas and 
theoretical methods arising at the intersection of statistical physics, computer science 
and neurobiology. We introduce the interrelated replica and cavity methods, which 
originated in statistical physics as powerful ways to quantitatively analyze large highly 
heterogeneous systems of many interacting degrees of freedom. We also introduce 
the closely related notion of message passing in graphical models, which originated 
in computer science as a distributed algorithm capable of solving large inference and 
optimization problems involving many coupled variables. We then show how both 
the statistical physics and computer science perspectives can be applied in a wide 
diversity of contexts to problems arising in theoretical neuroscience and data analysis. 
Along the way we discuss spin glasses, learning theory, illusions of structure in noise, 
random matrices, dimensionality reduction, and compressed sensing, all within the 
unified formalism of the replica method. Moreover, we review recent conceptual 
connections between message passing in graphical models, and neural computation and 
learning. Overall, these ideas illustrate how statistical physics and computer science 
might provide a lens through which we can uncover emergent computational functions 
buried deep within the dynamical complexities of neuronal networks. 
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1. Introduction 

Neuronal networks are highly complex dynamical systems consisting of large numbers 
of neurons interacting through synapses [HOE]. Such networks subserve dynamics 
over multiple time-scales. For example, on fast time scales, on the order of milliseconds, 
synaptic connectivity is approximately constant, and this connectivity directs the flow 
of electrical activity through neurons. On slower timescales, on the order of seconds 
to minutes and beyond, the synaptic connectivity itself can change through synaptic 
plasticity induced by the statistical structure of experience, which itself can stay constant 
over even longer timescales. These synaptic changes are thought to underly our ability 
to learn from experience. To the extent that such separations of timescale hold, one 
can exploit powerful tools from the statistical physics of disordered systems to obtain a 
remarkably precise understanding of neuronal dynamics and synaptic learning in basic 
models. For example, the replica method and the cavity method, which we introduce 
and review below, become relevant because they allow us to understand the statistical 
properties of many interacting degrees of freedom that are coupled to each other through 
some fixed, or quenched, interactions that may be highly heterogenous, or disordered. 

However, such networks of neurons and synapses, as well as the dynamical processes 
that occur on them, are not simply tangled webs of complexity that exist for their own 
sake. Instead they have been sculpted over time, through the processes of evolution, 
learning and adaptation, to solve important computational problems necessary for 
survival. Thus biological neuronal networks serve a function that is useful for an 
organism in terms of improving its evolutionary fitness. The very concept of function 
does not of-course arise in statistical physics, as large disordered statistical mechanical 
systems, like glasses or non-biological polymers do not arise through evolutionary 
processes. In general, the function that a biological network performs (which may 
not always be a-priori obvious) can provide a powerful way to understand both its 
structure and the details of its complex dynamics [I]. As the functions performed 
by neuronal networks are often computational in nature, it can be useful to turn to 
ideas from distributed computing algorithms in computer science for sources of insight 
into how networks of neurons may learn and compute in a distributed manner. In 
this review we also focus on distributed message passing algorithms whose goal is to 
compute the marginal probability distribution of a single degree of freedom in a large 
interacting system. Many problems in computer science, including error correcting 
codes and constraint satisfaction, can be formulated as message passing problems [5]. 
As we shall review below, message passing is intimately related to the replica and cavity 
methods of statistical physics, and can serve as a framework for thinking about how 
specific dynamical processes of neuronal plasticity and network dynamics may solve 
computational problems like learning and inference. 

This combination of ideas from statistical physics and computer science are not only 
useful in thinking about how network dynamics and plasticity may mediate computation, 
but also for thinking about ways to analyze large scale datasets arising from high 
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throughput experiments in neuroscience. Consider a data set consisting of P points 
in an N dimensional feature space. Much of the edifice of classical statistics and 
machine learning has been tailored to the situation in which P is large and N is small. 
This is the low dimensional data scenario in which we have large amounts of data. In 
such situations, many classical unsupervised machine learning algorithms can easily find 
structures or patterns in data, when they exist. However, the advent of high throughput 
techniques in neuroscience has pushed us into a high dimensional data scenario in which 
both P and N are large, but their ratio is 0(1). For example, we can simultaneously 
measure the activity of 0(100) neurons but often only under a limited number of trials 
(i.e. also 0(100)) for any given experimental condition. Also, we can measure the single 
cell gene expression levels of 0(100) genes but only in a limited number of cells. In such 
a high dimensional scenario, it can be difficult to find statistically significant patterns 
in the data, as often classical unsupervised machine learning algorithms yield illusory 
structures. The statistical physics of disordered systems again provides a powerful 
tool to understand high dimensional data, because many machine learning algorithms 
can be formulated as the minimization of a data dependent energy function on a set of 
parameters. We review below how statistical physics plays a useful role in understanding 
possible illusions of structure in high dimensional data, as well as approaches like random 
projections and compressed sensing, which are tailored to the high dimensional data 
limit. 

We give an outline and summary of this review as follows. In section [2] we introduce 
the fundamental techniques of the replica method and cavity method within the context 
of a paradigmatic example, the Sherrington-Kirkpatrick (SK) model |6] of a spin glass 
IU [9]. In a neuronal network interpretation, such a system qualitatively models a 
large network in which the heterogenous synaptic connectivity is fixed and plays the 
role of quenched disorder. On the otherhand, neuronal activity can fluctuate and we are 
interested in understanding the statistical properties of the neuronal activity. We will 
find that certain statistical properties, termed self- averaging properties, do not depend 
on the detailed realization of the disordered connectivity matrix. This is a recurring 
theme in these notes; in large random systems with microscopic heterogeneity, striking 
levels of almost deterministic macroscopic order can arise in ways that do not depend 
on the details of the heterogeneity. Such order can govern dynamics and learning 
in neuronal networks, as well as the performance of machine learning algorithms in 
analyzing data, and moreover, this order can be understood theoretically through the 
replica and cavity methods. 

We end section [2] by introducing message passing which provides an algorithmic 
perspective on the replica and cavity methods. Many models in equilibrium statistical 
physics are essentially equivalent to joint probability distributions over many variables, 
which are equivalently known and described as graphical models in computer science 
|10j . Moreover, many computations in statistical physics involve computing marginal 
probabilities of a single variable in such graphical models. Message passing, also 
known in special cases as belief propagation [TT], involves a class of algorithms that 
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yield dynamical systems whose fixed points are designed to approximate marginal 
probabilities in graphical models. Another recurring theme in these notes is that 
certain aspects of neuronal dynamics may profitably be viewed through the lens of 
message passing; in essence, these neuronal (and also synaptic) dynamics can be viewed 
as approximate versions of message passing in a suitably defined graphical model. 
This correspondence between neuronal dynamics and message passing allows for the 
possibility of both understanding the computational significance of existing neuronal 
dynamics, as well as deriving hypotheses for new forms of neuronal dynamics from a 
computational perspective. 

In section [3] we apply the ideas of replicas, cavities and messages introduced in 
section [2] to the problem of learning in neuronal networks as well as machine learning 
(see [12] for a beautiful book length review of this topic). In this context, training 
examples, or data play the role of quenched disorder, and the synaptic weights of a 
network, or the learning parameters of a machine learning algorithm, play the role of 
fluctuating statistical mechanical degrees of freedom. In the zero temperature limit, 
these degrees of freedom are optimized, or learned, by minimizing an energy function. 
The learning error, as well as aspects of the learned structure, can be described by 
macroscopic order parameters that do not depend on the detailed realization of the 
training examples, or data. We show how to compute these order parameters for 
the classical perceptron [T3l E], thereby computing its storage capacity. Also we 
compute these order parameters for classical learning algorithms, including Hebbian 
learning, principal components analysis (PCA), and K- means clustering, revealing that 
all of these algorithms are prone to discovering illusory structures that reliably arise in 
random realizations of high dimensional noise. Finally, we end section [3] by discussing 
an application of message passing to learning with binary valued synapses, known to 
be an NP-complete problem [151 H5]- The authors of [TTJ [18] derived a biologically 
plausible learning algorithm capable of solving random instantiations of this problem by 
approximating message passing in a joint probability distribution over synaptic weights 
determined by the training examples. 

In section HI we discuss the eigenvalue spectrum of random matrices. Matrices from 
many random matrix ensembles have eigenvalue spectra whose probability distributions 
display fascinating macroscopic structures that do not depend on the detailed realization 
of the matrix elements. These spectral distributions play a central role in a wide 
variety of fields [HI [20]; within the context of neural networks for example, they 
play a role in understanding the stability of linear neural networks, the transition 
to chaos in nonlinear networks [21], and the analysis of high dimensional data. We 
begin section H] by showing how replica theory can also provide a general framework for 
computing the typical eigenvalue distribution of a variety of random matrix ensembles. 
Then we focus on understanding an ensemble of random empirical covariance matrices 
(the Wishart ensemble [22]) whose eigenvalue distribution, known as the Marcenko- 
Pastur distribution [23], provides a null model for the outcome of PCA applied to high 
dimensional data. Moreover, we review how the eigenvalues of many random matrix 
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ensembles can be thought of as Coulomb charges living in the complex plane, and the 
distribution of these eigenvalues can be thought of as the thermally equilibriated charge 
density of this Coulomb gas, which is stabilized via the competing effects of a repulsive 
two dimensional Coulomb interaction and an attractive confining external potential. 
Moreover we review how the statistics of the largest eigenvalue, which obeys the Tracy- 
Widom distribution [23J [25] , can be understood simply in terms of thermal fluctuations 
of this Coulomb gas [261 The statistics of this largest eigenvalue will make an 

appearance later in section [5] when we discuss how random projections distort the 
geometry of manifolds. Overall, section H] illustrates the power of the replica formalism, 
and plays a role in connecting the statistical physics of two dimensional Coulomb gases 
to PCA in section 1331 and geometric distortions induced by dimensionality reduction in 
section 15.31 

In section [5] we discuss the notion of random dimensionality reduction. High 
dimensional data can be difficult to both model and process. One approach to 
circumvent such difficulties is to reduce the dimensionality of the data; indeed many 
machine learning algorithms search for optimal directions on which to project the 
data. As discussed in section 13.51 such algorithms yield projected data distributions 
that reveal low dimensional, illusory structures that do not exist in the data. An 
alternate approach is to simply project the data onto a random subspace. As the 
dimensionality of this subspace is lower than the ambient dimensionality of the feature 
space in which the data resides, features of the data will necessarily be lost. However, 
it is often the case that interesting data sets lie along low dimensional submanifolds in 
their ambient feature space. In such situations, a random projection above a critical 
dimension, that is more closely related to the dimensionality of the submanifold than to 
the dimensionality of the ambient feature space, often preserves a surprising amount of 
structure of the submanifold. In section [5] we review the theory of random projections 
and their ability to preserve the geometry of data submanifolds. We end section [5] 
by introducing a statistical mechanics approach to random dimensionality reduction 
of simple random submanifolds, like point clouds and hyperplanes. This analysis 
connects random dimensionality reduction to extremal fluctuations of 2D Coulomb gases 
discussed in sections 14.31 and 14.41 

The manifold of sparse signals forms a ubiquitous and interesting low dimensional 
structure that accurately captures many types of data. The field of compressed sensing 
[281 129] , discussed in section El rests upon the central observation that a sparse high 
dimensional signal can be recovered from a random projection down to a surprisingly low 
dimension by solving a computationally tractable convex optimization problem, known 
as Li minimization. In section [6] we focus mainly on the analysis of Li minimization 
based on statistical mechanics and message passing. For readers who are more interested 
in applications of random projections, compressed sensing and L\ minimization to 
neuronal information processing and data analysis, we refer them to [30J. There, diverse 
applications of how the techniques in sections [5] and [6] can be used to acquire and 
analyze high dimensional neuronal data are discussed, including, magnetic resonance 
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imaging [3TJ E21 E3] , compressed gene expression arrays |34J , compressed connectomics 
[35| 136] , receptive field measurements, and fluorescence microscopy [371 EH] of multiple 
molecular species at high spatiotemporal resolution [39] using single pixel camera [401HT] 
technology. Also diverse applications of these same techniques to neuronal information 
processing are discussed in [31)], including semantic information processing [121 H3l PH] . 
short-term memory [i5j 1^6] , neural circuits for L\ minimization [UJ, learning sparse 
representations [HI H9] , regularized learning of high dimensional synaptic weights from 
limited examples [50] . and axonally efficient long range brain communication through 
random projections [5TJ |52j [53J IM] . 

After introducing CS in 16.11 we show how replica theory can be used to analyze its 
performance in section [6721 Remarkably, the performance of CS, unlike other algorithms 
discussed in section 1331 displays a phase transition. For any given level of signal sparsity, 
there is a critical lower bound on the dimensionality of a random projection which is 
required to accurately recover the signal; this critical dimension decreases with increasing 
sparsity. Also, in section 16.31 we review how the L\ minimization problem can be 
formulated as a message passing problem [55]. This formulation yields a message passing 
dynamical system that qualitatively mimics neural network dynamics with a crucial 
history dependence terms. L\ minimization via gradient descent has been proposed 
as a framework for neuronal dynamics underlying sparse coding in both vision [56] 
and olfaction [57]. On the otherhand, the efficiency of message passing in solving L\ 
minimization, demonstrated in [55] may motivate revisiting the issue of sparse coding 
in neuroscience, and the role of history dependence in sparse coding network dynamics. 

Finally, the appendix in section [8] provides an overview of the replica method, 
in a general form that is immediately applicable to spin glasses, perceptron learning, 
unsupervised learning, random matrices and compressed sensing. Overall, the replica 
method is a powerful, if non-rigorous, method for analyzing the statistical mechanics of 
systems with quenched disorder. We hope that this exposition of the replica method, 
combined with the cavity and message passing methods discussed in this review within 
a wide variety of disparate contexts, will help enable students and researchers in both 
theoretical neuroscience and physics to learn about exciting interdisciplinary advances 
made in the last few decades at the intersection of statistical physics, computer science, 
and neurobiology. 

2. Spin Glass Models of Neural Networks 

The SK Model [6] is a prototypical example of a disordered statistical mechanical system. 
It has been employed as a simple model of spin glasses [7J IB] , as well as neural networks 
[58J, and has made a recent resurgence in neuroscience within the context of maximum 
entropy modeling of spike trains [591 ED]- It is defined by the energy function 
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where the are N spin degrees of freedom taking the values ±1. In a neural network 
interpretation, Sj represents the activity state of a neuron and J is the synaptic 
connectivity matrix of the network. This Hamiltonian yields an equilibrium Gibbs 
distribution of neural activity given by 

Pj(b) = ^f mS ' 3) (2) 

where 

Z[J] = J2e^ H{s ' J) (3) 
s 

is the partition function, and (3 is an inverse temperature reflecting sources of noise. 
The connectivity matrix is chosen to be random, where each Jy is an independent, 
identically distributed (i.i.d) zero mean Gaussian with variance 1/N. 

The main property of interest is the statistical structure of high probability (low 
energy) activity patterns. Much progress in spin glass theory [7] has revealed a 
physical picture in which the Gibbs distribution in (j2J) decomposes at low temperature 
(large 0) into many "lumps" of probability mass (more rigorously, pure states [61J) 
concentrated on subsets of activity patterns. Equivalently, these lumps can be thought 
of as concentrated on the minima of a free energy landscape with many valleys. Each 
lump, indexed by a, is characterized by a mean activity pattern m° = (si) a , where (•) 
is an average over configurations belonging to the free energy valley a, and a probability 
mass P a (the probability that a random activity pattern belongs to valley a). In the large 
N limit, free energy barriers between valleys diverge, so that in dynamical versions of 
this model, if an activity pattern starts in one valley, it will stay in that valley for infinite 
time. Thus ergodicity is broken, as time average activity patterns are not equal to the 
full Gibbs average activity pattern. The network can thus maintain multiple steady 
states, and we are interested in understanding the structure of these steady states. 

Now the detailed activity pattern in any free energy minimum a (i.e the mean 
pattern m") depends on the detailed realization of the connectivity J, and is hard to 
compute. However, many interesting quantities, that involve averages over all neurons, 
are self-averaging, which by definition means that their fluctuations across different 
realizations of J vanish in the large N limit. As we see below, typical values of such 
quantities, for any given realization of J, can be computed theoretically by computing 
their average over all J. One interesting quantity that probes the geometry of free 
energy minima is the distribution of overlaps between all pairs of activity patterns. If 
the activity patterns belong to two valleys, a and b, then the overlap is 

q ab = jfJ2 m t m l ( 4 ) 

i 

Now since P a is the probability a randomly chosen activity pattern belongs to valley a, 
the distribution of overlaps between any two pairs of activity patterns independently 
chosen from fl2]) is given by 

Pj(q) = Y,PaP b 5(q-q ab ). (5) 
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This distribution turns out not to be self- averaging (it fluctuates across realizations of 
J), unless there is only one valley, or state (modulo the reflection symmetry s» — >■ — s$), in 
which case the distribution becomes concentrated at a single number q, which is the self- 
overlap of the state, q = -h J2i.m 2 . If there is indeed one state, then q does not depend 
on the detailed realization of J and provides a measure of the variability of mean activity 
across neurons due to the quenched disorder in the connectivity. In the case of multiple 
valleys, one can also compute the disorder averaged overlap distribution 
despite the fact that the overlap distribution Pj(q) may not be self-averaging, its average 
over J can still yield a wealth of information about the geometric organization of free 
energy minima in neural activity space. This can be done using the replica method, 
which we now introduce. 



2.1. Replica Solution 

To understand the statistical properties of the Gibbs distribution in (j2J), it is useful 
to compute its free energy —0F[J] = \nZ[J]. Correlations between neurons can then 
be computed via suitable derivatives of the free energy. Fortunately, the free energy is 
self-averaging, which means that to understand the free energy for any realization of J, 
it suffices to compute its average over all J: 

«-^F[J]» J = «lnZ[J]» J , (6) 

where denotes an average over the disorder J. This average is difficult to do 

because the logarithm appears inside the average. The replica trick circumvents this 
difficulty by exploiting the identity 

Z n — 1 8 

In Z = lim = lim — Z n . (7) 

n->o n "->o an 

This identity is useful because it allows us to first average over an integer power of Z[J], 
which can be done more easily, and then take the n — > limit. Appendix [8] provides a 
general outline of the replica approach that can be used for many problems. Basically 
to compute the average over Z n it is useful to introduce n replicated neuronal activity 
patterns s a , for a = 1, . . . , n, yielding 

((2 n )) J = ((E { s^^-^ V -)} j . (8) 

Now the average over J can be performed because it is reduced to a set of Gaussian 
integrals. To do so, we use the fundamental identity 

(e**) g = e^ 2x \ (9) 

where z is a zero mean Gaussian random variable with variance a 2 . Applying this to 
(EJ) with z = Jij, a 1 = jf, and x = PY^a^^j yields 

= £ e ^E«(E^x •?•?)' = £ e ^E ab %, (10) 

{s°} {sn 
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where 

1 N 

Qa b = JrY, S i S i (11) 

is the overlap matrix between replicated activity patterns. 

Thus although for any fixed realization of the quenched disorder J, the replicated 
activity patterns s a were independent, marginalizing over, or integrating out the disorder 
introduces attractive interactions between the replicas. Consistent with the general 
framework, presented in section [HJ the interaction between replicas depends only on 
the overlap matrix Q, and we have in ( 1120p . E(Q) = — ^-^abQlb- Thus minimization 
of this energy function promotes alignment of the replicas. The intuition is that for 




Figure 1. Probability lumps in free energy valleys. Schematic figures of the space 
of all possible neuronal or spin configurations (large circle) and the space of spin 
configurations with non-negligible probability under the Gibbs distribution in ^) 
(shaded areas). (A) At high temperature all spin configurations are explored by the 
Gibbs distribution. Thus the inner product between two random spins drawn from 
the Gibbs distribution will typically have inner product, and so the replica order 
parameter q is 0. (B) The replica symmetric ansatz for a low temperature phase: the 
spins freeze into a small set of configurations (free energy valley) , which can differ from 
realization to realization of the connectivity J. However, the inner product between 
two random spins, and therefore also the replica order parameter, takes a nonzero value 
q that does not depend on the realization of J. (C) One possible ansatz for replica 
symmetry breaking (RSB) in which the replica overlap matrix Q is characterized by 
two order parameters, q\ > <?2- This ansatz, known as 1-step RSB, corresponds to a 
scenario in which the Gibbs distribution breaks into multiple lumps, with q\ describing 
the typical inner product between two configurations chosen from the same lump, and 
52 describing the typical inner product between configurations from different lumps. 
(D) There exists a series of k-step RSB schemes describing scenarios in which the 
Gibbs distribution decomposes into a nested hierarchy of lumps of depth k. This 
figure describes a possible scenario for k = 2. The true low temperature phase of the 
SK model is thought to be described by a particular k = oo RSB ansatz [7]. 

any fixed realization of J, the replicas will prefer certain patterns. Which patterns are 
preferred will vary across realizations of J. However, for any fixed realization of J, the 
preferred set of patterns will be similar across replicas since the fluctuations of each 
replicated neuronal activity pattern are controlled by the same quenched connectivity 
J. Thus even after averaging over J, we expect this similarity to survive, and hence we 
expect average overlaps between replicas to be nonzero. 

However, minimization of the energy E(Q) alone does not determine the overlap 
matrix Q a b. One must still sum over s a in (TlOT) which yields an entropic term 
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corresponding to the number of replicated activity patterns with a given set of overlaps. 
While energy minimization drives overlaps to be large, entropy maximization drives 
overlaps to be small, since there are many more replicated configurations with small, 
rather than large overlaps. This competition between energy and entropy leads to 
a potentially nontrivial overlap matrix. After computing this entropic term, the most 
likely value of the overlap matrix can be computed via the saddle point method, yielding 
a set of self-consistent equations for Q (a special case of (11261) . (11 271) ): 

Qab = (s a S b ) n , (12) 

where (•)„ denotes an average with respect to the Gibbs distribution P(s 1 , . . . ,s n ) = 

±e-f> H «, With H cS = -PEa b S a QaoS b . 

Now the physical meaning of the saddle point replica overlap matrix is explained 
in 18. 2[ it is simply related to the disorder averaged overlap distribution: 

(( Pj(q) »j = Inn — -1— £ 8(q - Qab), (13) 
n->o n(n - 1) 

where Pj(q) is given by ([S]). So the distribution of overlaps between pairs of free energy 
minima m" (weighted by their probability), is simply the distribution of off-diagonal 
matrix elements of the replica overlap matrix. Thus, in searching for solutions to ( fl2l . 
any ansatz about the structure of Q a b is implicitly an ansatz about the geometry and 
multiplicity of free energy valleys in (jSJ), averaged over J. 

Now the effective Hamiltonian yielding the average in ( Tl2l) is symmetric with respect 
to permutations of the replica indices a (i.e. permuting the rows and columns of Qab)- 
Therefore it is natural to search for a replica symmetric saddle point in which Q a b = q for 
all a 7^ b. This is equivalent to an assumption that there is only one free energy valley, 
and q measures its heterogeneity. Taking the n — >■ limit with this replica symmetric 
ansatz yields a saddle point equation for q (see ( 1142p for the derivation): 

g=((tanh 2 (/3^))) / (14) 

At high temperature (/3 < 1), q — is the only solution, representing a "paramagnetic" 
state (Fig. [TJA.) in which activity patterns fluctuate over all possible configurations, and 
average neural activity is for all i (Fig. HA). At lower temperature ((3 > 1), a 
nonzero solution rises continuously from 0, suggesting a phase transition to a "frozen" 
state corresponding to a single valley (Fig. [TJ3) in which each neuron has a different 
mean activity rrii. 

While this scenario seems plausible, a further analysis of this solution [61 162] yields 
inconsistent physical predictions (like negative entropy for the system). Within the 
replica framework, this inconsistency can be detected by showing that the replica 
symmetric saddle point for Q ab is unstable [63], and so one must search for solutions in 
which Q a b breaks the replica symmetry. This corresponds to a physical picture in which 
there are many free energy minima. A great deal of work has lead to a remarkably rich 
ansatz which predicts a nested hierarchical, tree like organization on the space of free 
energy minima (see Fig. fflCD), known as an ultrametric structure |64j. It is striking 
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that this highly symmetric and ordered low temperature hierarchical structure emerges 
generically from purely random, disordered couplings Jy. Unfortunately, we will not 
explore this phenomenon further here, since for most of the applications of replica theory 
to neuronal processing and data analysis discussed below, a replica symmetric analysis 
turns out to be correct. 

2.2. Chaos in the SK Model and the Hopfield Solution 

So far, in order to introduce the replica method, we have analyzed a toy neuronal 
network with a random symmetric connectivity matrix J, and found that such a network 
exhibits broken replica symmetry corresponding to a hierarchy of low energy states 
that are stable with respect to thermal or noise induced fluctuations. It is tempting 
to explore the possibility that this multiplicity of states may be useful for performing 
neural information processing tasks. However, several works have noted that while these 
states are stable with respect to thermal fluctuations, they are not structurally stable 
with respect to perturbations either to the inverse temperature (3, or the connectivity 
matrix J [651 1661 [67] . Indeed very small changes to or J induce macroscopic changes in 
the location of energy minima in the space of neuronal activity patterns. This sensitive 
dependence of low energy activity patterns to either (3 or J was called temperature 
or disorder chaos respectively in [66J. For neural information processing, it would be 
useful to instead have network connectivities whose noisy dynamics not only thermally 
stabilize a prescribed set of neuronal activity patterns, but do so in a manner that is 
structurally stable with respect to changes in either the connectivity or level of noise. 

An early proposal to do just this was the Hopfield model [68] • Suppose one wishes 
to find a network connectivity J that stabilizes a prescribed set of P iV-dimensional 
patterns £ M , for /i = 1, . . . , P, where £f = ±1. Hopfield's proposal was to choose 

j« = I e m- (is) 

This choice reflects the outcome of a Hebbian learning rule [69J in which each synapse 
from neuron j to neuron i changes its synaptic weight by an amount proportional to the 
correlation between the activity on its presynaptic and postsynaptic neuron. When the 
activity pattern £ M is imposed upon the network, this correlation is and when all 

P patterns are imposed upon the network in succession, the learned synaptic weights 
are given by (ITS"]) . 

This synaptic connectivity J induces an equilibrium probability distribution over 
neuronal activity patterns s through ([2]). Ideally, this distribution should have 2P free 
energy valleys, corresponding to lumps of probability mass located near the P patterns 
£ M and their reflections — If so, then when network activity s is initialized to either 
a corrupted or partial version of one of the learned patterns the network will relax 
(under a dynamics whose stationary distribution is given by (T5])) to the free energy 
valley corresponding to £ M . This relaxation process is often called pattern completion. 
Thus Hopfield's prescription provided a unifying framework for thinking about learning 
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and memory: the structure of past experience (i.e. the patterns £ M ) are learned, or 
stored, in the network's synaptic weights (i.e. through ffl5|) ). and subsequent network 
dynamics can be viewed as motion down a free energy landscape determined by the 
weights. If learning is successful, the minima of this free energy landscape correspond to 
past experiences, and the process of recalling past experience corresponds to completing 
partial or corrupted initial network activity patterns induced by current stimuli. 

A key issue then is storage capacity: how many patterns P can a network of iV 
neurons store? This issue was addressed in [70j E3] via the replica method in the 
situation where the stored patterns £ M are random and uncorrelated (each £f is chosen 
independently to be +1 or —1 with equal probability). These works extensively analyzed 
the properties of free energy valleys in the Gibbs distribution ([2]) with connectivity f jT5|) . 
as a function of the inverse temperature (3 and the level of storage saturation a = 
This problem fits the classic mold of disordered statistical physics, where the patterns 
£ M play the role of quenched disorder, and neuronal activity patterns play the role of 
thermal degrees of freedom. In particular, the structure of free energy minima can be 
described by a collection of self-averaging order parameters m M = jr £ M • s, denoting the 
overlap of neuronal activity with pattern /i. Successful pattern completion is possible if 
there are 2P free energy valleys such that the average of m fl in each valley is large for one 
pattern /i and small for all the rest. These free energy valleys can be thought of as recall 
states. The replica method in [701 EI] yields a set of self-consistent equations for these 
averages. Solutions to the replica equations, in which precisely one order parameter m} 1 
is large, are found at low temperature only when a < a c = 0.138. For a > a c , the 
system is in a spin glass state with many free energy minima, none of which have a 
macroscopic overlap with any of the patterns (in the solutions to the replica equations, 
no average is 0(1) as P, N — > oo with a > a c ). At such high levels of storage, so 
many patterns "confuse" the network, so that its low energy states do not look like any 
one pattern Indeed the free energy landscape of the Hopfield model as a becomes 
large behaves like the low temperature spin glass phase of the SK model discussed in 
the previous section. 

Even for a < a c , at low enough temperatures, spurious, metastable free energy 
valleys corresponding to mixtures of patterns can also arise. These mixture states are 
characterized by solutions to the replica equations in which the average m M is 0(1) for 
more than one fi. However, as the temperature is increased, such mixture states melt 
away. This phenomenon illustrates a beneficial role for noise in associative memory 
operation. However, there is a tradeoff to melting away mixture states by increasing 
temperature, as a c decreases with increasing temperature. Nevertheless, in summary 
there is a robust region in the a — (3 phase plane with a = O(0. 1) and (3 corresponding 
to low temperatures, in which the recall states dominate the free energy landscape 
over neural activity patterns, and the network can successfully operate as a pattern 
completion, or associative memory device. Many important details about the phase 
diagram of free energy valleys as a function of a and (3 and be found in [70j [71] . 
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2.3. Cavity Method 

We now return to an analysis of the SK model through an alternate method that sheds 
light on the physical meaning of the saddle point equation for the replica symmetric order 
parameter q in (Q3J), which may seem a bit obscure. In particular, we give an alternate 
derivation of (HM through the cavity method [7J [72] which provides considerable physical 
intuition for fTH|) by describing it as a self-consistency condition. In general, the cavity 
method, while indirect, can often provide intuition for the final results derived via more 
direct replica methods. 

The starting point involves noting that the SK Hamiltonian ([1]) governing the 
fluctuations of N neurons can be written as 

H N (8,J) = -Sihi + H v , (16) 

where 

N 

h 1 = J2j li s i (17) 

i=2 

is the local field acting on neuron 1, and 

1 N 

H V = _ 9 X, 3ii s i s v ( 18 ) 

1 ij=2 

is the Hamiltonian of the rest of the neurons S2, ■ ■ ■ , s^. Since hi is a sum of many 
terms, it is tempting to approximate its thermal fluctuations in the full system of N 
neurons in (flBT) by a Gaussian distribution. However, such a Gaussian approximation 
is generally invalid because the individual terms are correlated with each other. One 
source of correlation arises from a common coupling of all the neurons S2, . . . , sjv to s±. 
For example, because si interacts with Sj through the symmetric coupling Jij = Ja, 
whenever s\ = +1 (or s± = —1) this exerts a positive (or negative) effect on the 
combination JijSj. Thus all individual terms in (fTTj) exhibit correlated fluctuations 
due to common coupling to the fluctuating neuron si. 

The key idea behind the cavity method is to consider not the distribution of the 
local field hi acting on neuron 1 in the full system of N neurons in fflBI) . but instead the 
distribution of hi in a "cavity system" of iV — 1 neurons obtained by removing si from 
the system, thereby leaving "cavity" (see Fig. [2]\B). Then hi is known as the cavity 
field, or the field exerted on neuron 1 by all the others in the absence of neuron 1, and 
its distribution is given by that of hi ffTTj) in a Gibbs distribution with respect to ffT8]) ; 

1 N 
P\i(h) = — £ i^-^J^e"^ (19) 

^\l s 2 ,...,s N j=2 

The joint distribution of si and its local field hi in the full system of iV spins can be 
written in terms of the cavity field distribution as follows: 



1 



P N (s 1 ,hi) = — ]T S{hi-J2^i)^ H 
1 



N 



y 

s 2 ,...,s N i=2 



z e-^ M) P\i(hi), (20) 
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where V(si,hi) = —s\h\. 

The advantage of writing the joint distribution of s\ and hi in terms of the cavity 
field distribution P\i{h\) is that one can now plausibly make a Gaussian approximation 
to P\i(hi), i.e. the distribution of (IT7|) in the cavity system ( Fl8|) of neurons 2, . . . , N in 
the absence of 1. Because the cavity system does not couple to neuron 1, it does not know 
about the set of couplings Ju, and therefore the thermal fluctuations of cavity activity 



Figure 2. The cavity Method. (A) A network of neurons, or spins. (B) A cavity 
surrounding a single neuron, Si, that has been removed from the system. (C) In a 
replica symmetric approximation, the full distribution of the field h\ exerted on the 
cavity (in the absence of Si) by all other neurons can be approximated by a Gaussian 
distribution, while the joint distribution of si and h\ takes the form in equation (|20j) . 



patterns s 2 , ...,sjv, while of course correlated with each other, must be uncorrelated 
with the couplings Ju, unlike the case of these same fluctuations in the presence of 
Sj. Motivated by this lack of correlation, we can make a Gaussian approximation to 
the thermal fluctuations of hi in the cavity system H\\. Note that this does not imply 
that the local field hi in the full system Hn is Gaussian. Indeed, if P\i(hi) in ( |20l) 
is Gaussian, then P N (hi) obtained by marginalizing out s x in P N (si,hi) cannot be 
Gaussian; as discussed above, this non-gaussianity arises due to positive correlations 
between si and hi induced by their coupling V(si,hi). The simplification in replacing 
the network with a fluctuating field is shown in the transition from FigJ2B to ED. 
Under a Gaussian approximation, P\ 1 (/i 1 ) is characterized by its mean 



A 




C 



N 




(21) 



i=2 



and variance 



N 




(22) 



N 




(23) 




1 - 



(24) 
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where q is the order parameter 
1 N 

(25) 

i=l 

and 5sj = Si — (si)\i- Here we have neglected various terms that vanish in the large 
TV limit, but most importantly, in going from fl22|) to ff23|) . we have made a strong 
assumption that the connected correlation (SsiSsj)^ vanishes in the large N limit fast 
enough that we can neglect all off-diagonal terms in (J22J) . This can be true if the cavity 
system (and consequently the full system) is accurately described by a single free energy 
valley. On the otherhand, if the system is described by multiple free energy valleys, the 
connected correlation will receive contributions from fluctuations across valleys, and we 
cannot neglect the off-diagonal terms [7]. Thus the validity of this cavity approximation 
is tantamount to an assumption of replica symmetry, or a single valley in the free energy 
landscape. As discussed above, under the assumption of a single valley, we expect q to 
be self-averaging: it does not depend on the detailed realization of J^- in the large N 
limit. Finally, we note that the cavity method can be extended to scenarios in which 
replica symmetry is broken and there are multiple valleys [7]. 

In the replica symmetric scenario, under the Gaussian approximation to P\i(hi), 
( )20l) becomes 

Pn{si, h x ) = 7Uh . 1 - e-W^-^^M 2 , (26 ) 

Z[{hi) v ,l-q] 

allowing us to compute the mean activity of neuron % in the full system of iV neurons, 
in terms of the mean 1 and variance 1 — q of its cavity field, 

( Sl | (h^, l-q) N = J2 siPn(s u h x ). (27) 

si 

But now we must compute q, which we can do by demanding self consistency of the 
cavity approximation. First of all, we note that there was nothing special about neuron 
1; the above procedure of forming a cavity system by removing neuron 1 could have 
been done with any neuron. Thus f[2T)| |2"7|) holds individually for all neurons i, and we 
can average these equations over all % to obtain an expression for q: 

1 N 

?= vE^IMv 1 -^- (28) 
JV i=i 

However, we do not yet know (hi)^ for each i. For each i, (hi)^ = J2k=£i Jik{Sk)\i is 
a random variable due to the randomness of the couplings J^, which are uncorrelated 
with (sfc)y- by virtue of the fact that this thermal average occurs in a cavity system 
in the absence of i. Thus we expect the distribution of over random realizations 

of Jik to be gaussian, with a mean and variance that are easily computed to be and 
q respectively. Furthermore, we expect this distribution to be self-averaging: i.e. the 
distribution of (hi)u for a fixed i across different realizations of J should be the same 
as the distribution of across different neurons % for a fixed realization of J, in 

the large iV limit. Under this assumption, although we may not know each individual 
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we can replace the average over neurons in (1281) with an average over a Gaussian 
distribution, yielding 



Here ((•)) z denotes a "quenched" average with respect to a zero mean unit variance 
Gaussian variable z, reflecting the heterogeneity of the mean cavity field across neurons, 
and the thermal average (-) N is computed via (1261 1271) . and reflects the thermal 
fluctuations of a single neuron in the presence of a cavity field with mean and variance 



Equation f[2~9~|) is a self-consistent equation for the order parameter q which is itself 
a measure of the heterogeneity of mean activity across neurons. So physically, (j29|) 
reflects a demand that the statistical properties of the cavity fields are consistent with 
the heterogeneity of mean neural activity. Now finally, we can specialize to the SK 
model in which V(s, h) = —sh in ( 1261) . which yields (si | y/qz, 1 — q) N = ta.nh.f3y/qz in 
( 127|) . and when this is substituted into ( 129|) . we recover the self-consistent equation for 
q in (THj) . derived via the replica method. 

2-4- Message Passing 

So far, we have seen two methods which allow us to calculate self- averaging quantities 
(for example q = ( s i) 2 ) that do not depend on the detailed realization J. However, 
we may wish to understand the detailed pattern of mean neural activity, i.e. (sj) for 
all i, for some fixed realization of Jij. Mathematically, this corresponds to computing 
the marginal distribution of a single neuron in a full joint distribution given by ([2]). 
Here we introduce efficient distributed message passing algorithms from computer 
science [HI [IH [10] that have been developed to compute such marginals in probability 
distributions which obey certain factorization properties. 

Consider for example a joint distribution over N variables xi, ■ ■ • ,xn that factorizes 
into a set of P factors, or interactions, indexed by a = 1, ... P: 



Here X{ is any arbitrary variable that could be either continuous or discrete, and x a 
denotes the collection of variables that factor a depends on. Thus we systematically 
abuse notation and think of each factor index a also as a subset of the iV variables, with 
variable i G a if and only if factor ip a depends on X{. The factorization properties of P 
can be visualized in a factor graph, which is a bipartite graph whose nodes correspond 
either to variables % or factors a, and there is an edge between a factor a and variable i 
if and only if i e a, or equivalently factor i\) a depends on x% (see Fig. [3JA.) . For example, 
the SK model, or more generally any neural system with an equilibrium distribution, 
corresponds to a factor graph in which the neurons s, are the variables x^, and the 
factors correspond to nonzero synaptic weights connecting pairs of neurons. Thus 
each a corresponds to a neuron pair a = (ij), and in the SK model of equation 




(29) 



yfqz and 1 — q, respectively. 




a=l 



(30) 




o 

Figure 3. Message Passing. (A) A factor graph in which variable nodes are 
represented by circles, and factor nodes are represented by squares. (B) The flow 
of messages involved in the update of the message M^ a (xi). (C) The message passing 
approximation to the joint distribution of Xi,Xj, and x^. Here the interaction a is 
treated exactly, while the effect of all other interactions besides a are approximated 
by a product of messages. (D) Exact message passing in a chain; the marginal on xt 
is computed exactly as a product of two messages. 



The utility of the factor graph representation is that an iterative algorithm to 
compute the marginals, 

P{Xi)= £ P{ Xl ,...,X N ) (31) 

for all i can be visualized as the flow of messages along the factor graph (j3]B). We first 
define this iterative algorithm and then later give justification for it. Every message is a 
probability distribution over a single variable, and at any given time t there are two types 
of messages, one from variables to factors, and the other from factors to variables. We 
denote by Mj^ b (xj) the message from variable j to factor b, and by Ml_ H (xi) the message 
from factor b to variable i. Intuitively, we can think of Mj^ b (xj) as an approximation 
to distribution on Xj induced by all other interactions besides interaction b. In contrast, 
we can think of M|_^( %i J SbS sin approximation to the distribution on X{ induced by the 
direct influence of interaction b alone. These messages will be used below to approximate 
the marginal of Xi in the full joint distribution of all interactions (see e.g. ( 1341) ). 

The (unnormalized) update equation for a factor to variable message is given by 

MlX\{x t ) = J2M%b) II ( 32 ) 

x b\i 

where b\i denotes the set of all variables connected to factor node b except i (see Fig. 
[3b). Intuitively, the direct influence of b alone on i (the left hand side of (132]) ) is 
obtained by marginalizing out all variables other than i in the factor ipj,, supplemented 
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by accounting for the effect of all of the other interactions besides b on variables j G b\i 
by the product of messages Mj^ b (xj) (see Fig. [3b). The (unnormalized) update equation 
for the variable to factor messages is then given by 

M&fa) = II (33) 

b£i\a 

Intuitively, the distribution on Xi induced by all other interactions besides interaction a 
(the left hand side of ( 133]) ) is simply the product of the direct influences of all interactions 
b that involve variable i, except for interaction a (see Fig. [3b). Message passing involves 
randomly initializing all the messages and then iteratively running the update equations 
f [32|33|) until convergence. One exception to the random initialization is the situation 
where any variable i is connected to only 1 variable node a. In this case, Mi^ a (xi) is 
initialized to be a uniform distribution over Xj, since in the absence of a, variable i feels 
no influence from the rest of the graph. Under the message passing dynamics, Mj^ a (xj) 
will remain a uniform distribution. Now for general factor graphs, convergence is not 
guaranteed, but if the algorithm does converge, then the marginal distribution of a 
variable be approximated via 

P^ocn^W- (34) 
and indeed the the joint distribution of all variables i G a can be approximated via 

P(x a ) cx^ a (x a )n M "a(^)- (35) 

The update equations (I32|l32j) . while intuitive, lead to two natural questions: for 
which factor graphs will they converge, and if they converge, how well will the fixed point 
messages and M™ a approximate the true marginals though equations H34|I35 j) ? A 

key intuition arises from the structure of the approximation to the joint marginal of the 
variables x a in ([35]) (see also Fig. |3p). This approximation treats the coupling of the 
variables % G a through interaction a by explicitly including the factor ip a . However, 
it approximates the effects of all other interactions b on these variables by a simple 
product of messages M°^ a (xi). An exactly analogous approximation is made in the 
update equation ( 132]) . Such approximations might be expected to work well whenever 
removing the interaction a leads to a factor graph in which all the variables % that were 
previously connected to a are now weakly coupled (ideally independent) under all the 
remaining interactions b ^ a. 

This weak coupling assumption under the removal of a single interaction holds 
exactly whenever the factor graph is a tree, with no loops. Indeed in such a case, 
removing any one interaction a removes all paths through the factor graph between 
variables i G a. In the absence of any such paths, all pairs of variables i G a are 
independent, and their joint distribution factorizes, consistent with the approximations 
made in ( 132]) and ( 135]) . In general, whenever the factor graph is a tree, the message 
passing equations converge in finite time, and the fixed point messages yield the true 
marginals [TTJ. We will not give a general proof of this fact, but we will illustrate it in the 
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case of a one dimensional Ising chain (see Fig. ED)- Consider the marginal distribution 
of a spin at position i in the chain. This spin feels an interaction to its left and right, 
and so ( |35i) tells us the marginal is a product of two converged messages at time t = oo: 

P( Sl ) oc M$_ lt{Hi {8i) M% i+1) ^( Si ), (36) 

Each of these two messages can be computed by iterating messages from either end of 
the chain to position i. For example, the rightward iteration for computing M^Lj^w^Sj) 
is 

where the first equality is a special case of (1331) and the second is a special case of (|32|) . 
The first message in this iteration, M°_>.(i 2 )(si) is initialized to be a uniform distribution, 
since spin 1 is only connected to a single interaction (1,2). A similar leftward iteration 
leads to the calculation of M^ i+1 \_^(si). Each iteration converges in an amount of time 
given by the path length from each corresponding end to i, and after convergence, we 
have 

M%_ MH ,( Si ) = E e^riW^+x (38) 

Sl,...,Sj_l 

M5W*(*)= E e^tt-W^. (39) 

Sj + i,...,sjv 

Inserting ( 1381) and ( |39l) into ( 1361) yields the correct marginal for s», and the normalization 
factor can be fixed at the end by demanding P(+l)+P(— 1) = 1. Note that whereas 
a naive sum over all spin configurations to compute the marginal over Sj would require 
0(2^) operations, this iterative procedure for computing the marginal requires only 
0(N) operations. Moreover, two sweeps through the chain allow us to compute all the 
messages, and therefore all N marginals simultaneously, as ( 136]) holds for all i. Overall, 
this method is essentially identical to the transfer matrix method for the ID Ising chain, 
and is a generalization of the Bethe approximation [73] . 

Although message passing is only exact on trees, it can nevertheless be applied to 
graphical models with loops, and as discussed above, it should yield good approximate 
marginals whenever the variables adjacent to a factor node are weakly correlated upon 
removal of that factor node. We will see successful examples of this in the contexts of 
learning in section 13.61 and compressed sensing in 16.31 An early theoretical advance in 
partially justifying the application of message passing to graphical models with loops 
was a variational connection: each solution to the fixed point equations of message 
passing are in one to one correspondence with extrema of a certain Bethe free energy 
[71], an approximation to the Gibbs free energy in variational approaches to inference 
in graphical models that is exact on trees (see [75] for a review). However there are 
no known general and precise conditions under which message passing in graphical 
models with many loops is theoretically guaranteed to converge to messages that yield a 
good approximation to the marginals. Nevertheless, in practice, message passing seems 
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to achieve empirical success in approximating marginals when correlations between 
variables adjacent to a factor node are indeed weak after removal of that factor. 

We conclude this section by connecting message passing back to the replica method. 
In general, suitable averages of the messaging passing equations reduce to both the 
cavity equations and the replica equations [5]. To illustrate this in the special case of 
the SK model, we outline the derivation of the replica saddle point equation (j!4p from 
the perspective of message passing. We first note that since every factor node in the 
SK model has degree 2, the update of a message from a factor to variable j, i.e. 
M(jj)_>j(sj) depends only on the message M^^j^Si) through, 

M £W s i) = E ^ JiiSiSi (4°) 



which is a special case of (1321) . Thus we can take one set of messages, for example 
the node to factor messages, Mf^^Jsi), as the essential degrees of freedom upon 
which the message passing dynamics operates. We simplify notation a little by letting 
M^j(si) = Mj_^u Asi) . Then the remaining message passing update (133]) yields the 
dynamics 

Mt^M) = II E e^ s *MU(s k ), (41) 

kei\j s k 

where k G i if and only if Sk is coupled to Si through a nonzero Jjfc. 

Now each message is a distribution over a binary variable, and all such distributions 
can be usefully parameterized by a single scalar parameter: 

MU^Si) oc e ^ s \ (42) 

Here the scalar parameter h\_+j can be thought of as a type of cavity field; as t — > oo, 
if message passing is successful, converges to the field exerted on spin i by all 

the spins in a cavity system in which the interaction is removed. In terms of this 
parameterization, the message passing updates (jUJ) yield a dynamical system on the 
cavity fields [76j, 

h$%= £ uiJ^hU). (43) 

kei\j 

Here u(J,h) is defined implicitly through the relation 

0u( j,h)s ^ ePJss'+phs' ^ ( 44 ) 



v ' ' oc 

Physically u(J,h) is the effective field on a binary spin s coupled with strength J to 
another spin s' that experiences an external field of strength h, after marginalizing out 
s'. Explicitly, 

u(J, h) = — arctanh [tanh(/3 J) tanh(/3/i)] . (45) 

In the weak coupling limit of small J, u(J,h) w Jtanh(/3h), which reflects the simple 
approximation that the average magnetization of s', due to the external field h (which 
would be tanh(/3/i) if the back-reaction from s can be neglected), exerts a field Jtanh(/3h) 
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on s. The more complex form of u(J, h) in (145 j) reflects the back-reaction of s on s' that 
becomes non-negligible at larger values of the bi-directional coupling J. In f|43|) . the 
updated cavity field turns out to be a simple sum over all the spins k (besides j) 
of this same effective field u obtained by marginalizing out s k in the presence of its own 
cavity field 

Using (|43p . we are now ready to derive f|T4|) . The key point is to consider self- 
consistency conditions for the distribution of cavity fields h\^- as t — > oo. We can 
think of this distribution in two ways. First, for a fixed i and j, h^j is a random 
variable due to the random choice of couplings J. Second, for a fixed realization of 
J, at a message passing fixed point, there is an empirical distribution of cavity fields 
hf^j across all choices of pairs i and j. The assumption of self-averaging means that 
as N — > oo, the latter empirical distribution converges to the distribution of the former 
random variable. In any case, we would like to write down a self-consistent equation 
for this distribution, by observing that this distribution must be self-reproducing under 
the update equation (fl3j) . More precisely, in (H3l) . if the couplings Jo- are drawn i.i.d 
from a distribution P{J), and the cavity fields /?4-»i are drawn i.i.d. from a distribution 
Q(h), then the induced distribution on ht^j should be identical to Q(h). This yields a 
recursive distributional equation characterizing the distribution of cavity fields Q(h) at 
a message passing fixed point: 



Q(h) = I n dJ k P(J k ) II dh k Q(h k ) 5[h-Y, u(J k , h k )) . (46) 

k k \ k J 



Here we have suppressed the arbitrary indices i and j. More generally, one can track 
the time-dependent evolution of the distribution of cavity fields, an algorithmic analysis 
technique known as density evolution [5]. 

In general, it can be difficult to solve the distributional equation for Q(h). However, 
in the SK model, one could make an approximation that the distribution of cavity fields 
is a zero mean Gaussian with variance q. Then the distributional equation for Q(h) 
reduces to a self-consistency condition for q by taking the expectation of h 2 on each 
side of ( 146]) . The left hand side is by definition q. To simplify the right hand side, 
since the couplings J k have variance of we can use the small coupling approximation 
u(J, h) m Jtanh(/3/i). Then averaging h 2 on both sides of (146!) yields 

q= f II dJ k P{Jk) II dh k Q(hk) f E J * tanh P h >) (47) 

h k \ k ) 

II dJ k P(J fc ) I] dh k Q{h k ) (E Jfe tanh 2 f3h k ) (48) 

k k \ k ) 

II dh k Q(h k ) ^1 E tanh 2 /3h k ^j (49) 

dhQ(h) tanh 2 /3/i. (50) 

Now, since we have assumed Q(h) is zero mean Gaussian with variance q, this is 
equivalent to the replica symmetric saddle point equation ( 1T4"]) . 
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In summary, we have employed a toy model of a neural network, the SK spin 
glass model, to introduce the various replica, cavity and message passing approaches to 
analyzing disordered statistical mechanical systems. In each case we have discussed in 
detail the simplest possible ansatz concerning the structure of free energy landscape, 
namely the replica symmetric ansatz, corresponding to a single valley with weak 
connected correlations between degrees of freedom. While this assumption is not true 
for the SK model, it nevertheless provides a good example system in which to gain 
familiarity with the various methods. And fortunately, for many of the applications 
discussed below, the assumption of a single free energy valley governing fluctuations 
will turn out to be correct. Finally, we note that just as the replica and cavity methods 
can be extended [7J to scenarios in which replica symmetry is broken, corresponding 
to many free energy valleys and long range correlations, so too can message passing 
approaches. Indeed viewing optimization and inference problems through the lens of 
statistical physics has lead to a new algorithm, known as survey propagation [77J [78] , 
which can find good marginals, or minimize costs, in free energy landscapes characterized 
by many metastable minima that can confound more traditional, local algorithms. 

3. Statistical Mechanics of Learning 

In the above sections, we have reviewed powerful machinery designed to understand the 
statistical mechanics of fluctuating neural activity patterns in the presence of disordered 
synaptic connectivity matrices. A key conceptual advance made by Gardner [791 EQ] was 
that this same machinery could be applied to the analysis of learning, by performing 
statistical mechanics directly on the space of synaptic connectivities, with the training 
examples presented to the system playing the role of quenched disorder. In this section 
we will explore this viewpoint and its applications to diverse phenomena in neural and 
unsupervised learning (see [12] for an extensive review of this topic). 

3.1. Perceptron Learning 

The perceptron is a simple neuronal model defined by a vector of N synaptic weights w, 
which linearly sums a pattern of incoming activity £, and fires depending on whether 
or not the summed input is above a threshold. Mathematically, in the case of zero 
threshold, it computes the function a = sgn(w-£), where a = +1 represents the firing 
state and a = — 1 represents the quiescent state. Geometrically, it separates its input 
space into two classes, each on opposite sides of the N — 1 dimensional hyperplane 
orthogonal to the weight vector w. Since the absolute scale of the weight vector w is 
not relevant to the problem, we will normalize the weights to satisfy w ■ w = N, so that 
the set of perceptrons live on an N — 1 dimensional sphere. 

Suppose we wish to train a perceptron to memorize a desired set of P input-output 
associations, £ M — > cr M . Doing so requires a learning rule (an algorithm for modifying 
the synaptic weights w based on the inputs and outputs) that finds a set of synaptic 
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weights w that satisfies the P inequalities 

w-^">0 V fi = l,...,P. (51) 

We will see below, that as long as there exists a simultaneous solution w to the P 
inequalities, then remarkably, a learning rule, known as the perceptron learning rule 
[T3] , can find the solution. The main remaining question is then, under what conditions 
on the training data cr^} does a solution to the inequalities exist? 

A statistical mechanics based approach to answering this question involves defining 
an energy function on the N — 1 dimensional sphere of perceptrons as follows, 

E(w) = J2V(\n, (52) 

where A M = ■ o" M £ M is the alignment of example /i with the weight vector w. 

Successfully memorizing all the patterns requires all alignments to be positive, so V 
should be a potential that penalizes negative alignments and favors positive ones. Indeed 
a wide variety of learning algorithms for the perceptron architecture can be formulated 
as gradient descent on E(w) for various choices of potential functions V(A) in (1521) [12J. 
However, if we are interested in probing the space of solutions to the inequalities ([ST]) , 
it is useful to take V(X) = 8(—X), where 9{x) is the Heaviside function (9(x) — l,x > 0, 
and otherwise). With this choice, the energy function in ( 1521) simply counts the number 
of misclassified examples, and so the Gibbs distribution 

P( W ) = I e -mw) (53) 

Zj 

in the zero temperature (/3 — >■ oo) limit becomes a uniform distribution on the space of 
perceptrons satisfying (l5Tj) (see Fig. HJ). Thus the volume of the space of solutions to 




Figure 4. Perceptron Learning. (A) The total sphere of all perceptron weights (grey 
circle) and a single example (black arrow). The blue region is the set of perceptron 
weights that yield an output +1 on the example. (B) The same as (A), but for a 
different example. (C) The set of weights that yield +1 on both examples in (A) and 
(B). (D) As more examples are added, the space of correct weights shrinks, and its 
typical volume is given by 1 — g, where q is the replica order parameter introduced in 
section 13.31 

( IBTT) . and in particular, whether or not it is nonzero, can be computed by analyzing the 
statistical mechanics of ( |53i) in the zero temperature limit. 
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3.2. Unsupervised Learning 

This same statistical mechanics formulation can be extended to more general 
unsupervised learning scenarios. In unsupervised learning, one often starts with a set 
of P data vectors for \i = 1, . . . , P, where each vector is of dimension N. For 
example, each vector could be a pattern of expression of N genes across P experimental 
conditions, or a pattern of activity of iV neurons in response to P stimuli. The overall 
goal of unsupervised learning is to find simple hidden structures or patterns in the data. 
The simplest approach is to find an interesting single dimension spanned by the vector 
w, such that the projections A M = -^Lw-^ of the data onto this single dimension yield 
a useful one dimensional coordinate system for the data. This interesting dimension 
can often be defined by minimizing the energy function (1521) . with the choice of ^(A) 
determining the particular unsupervised learning algorithm. One choice, V(X) = —X 
corresponds to Hebbian learning. Upon minimization of E(w), this choice leads to 
w oc Z)uC M i i- e - w points in the direction of the center of mass of the data. In situations 
in which the data has its center of mass at the origin, a useful choice is V(X) = —A 2 . 
Under this choice, w points in the direction of the eigenvector of maximal eigenvalue of 
the data covariance matrix C = • This is the direction of maximal variance in 

the data, also known as the first principal component of the data; i.e. it is the direction 
which maximizes the variance of the distribution of A M across data points \x. 

Beyond finding an interesting dimension in the data, another unsupervised learning 
task to find clusters in the data. A popular algorithm for doing so is .fT-means clustering. 
This is an iterative algorithm in which one maintains a guess about K potential cluster 
centroids in the data, w 1; . . . , w^-. At each iteration in the algorithm, each cluster i is 
defined to be the set of data points closer to centroid than to any other centroid. 
Then all the cluster centroids w, are optimized by minimizing the sum of the distances 
from Wj to those data points £ M assigned to cluster i. In the case where the distance 
measure is euclidean distance, this step just sets each centroid Wj to be the center of 
mass of the data points assigned to cluster i. The cluster assignments of the data are 
then recomputed with the new centroids, and the whole process repeats. The idea is 
that if there are K well separated clusters in the data, this iterative procedure should 
converge so that each Wj is the center of mass of cluster i. 

For general K , this iterative procedure can be viewed as an alternating minimization 
of a joint energy function over cluster centroids and cluster membership assignments. For 
the special case of K = 2, and when both the data and cluster centroids are normalized 
to have norm N, this energy function can be written as 



p 




(54) 



where 




i 



(55) 
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and, 



V(Ai,A 2 ) 



-A x 0(Ai - A 2 ) - \ 2 6{\ 2 - Ai) 
|(Ai + A 2 ) - -|Ai - A 2 |. 



(56) 
(57) 



Gradient descent on this energy function forces each centroid Wj to perform Hebbian 
learning only on the data points that are currently closest to it. 

3.3. Replica Analysis of Learning 

Both perceptron learning and unsupervised learning, when formulated as statistical 
mechanics problems as above, can be analyzed through the replica method. A natural 
question for perceptron learning is how many associations P can a perceptron with N 
synapses memorize? One benchmark is the case of random associations where ^ is a 
random vector drawn from a uniform distribution on a sphere of radius N and a u = ±1 
each with probability half. Similarly, a natural question for unsupervised learning, is 
how do we assess the statistical significance of any structure or pattern we find in a high 
dimensional dataset consisting of P points in N dimensions? To address this question 
it is often useful to analyze what structure we may find in a null data distribution that 
itself has no structure, for example when the data points ^ are drawn from a uniform 
distribution on the N — 1 sphere (or equivalently, in the large N limit, from a Gaussian 
distribution with identity covariance matrix). 

In both cases, the analysis simplifies in the "thermodynamic" limit P, N — > oo 
with the ratio a = P/N held constant. Fortunately, this is the limit of relevance to 
neural models with many synaptic weights, and to high dimensional data. The starting 
point of the analysis involves understanding the low energy configurations of the Gibbs 
distribution in (1531) . In the thermodynamic limit, important observables, like the volume 
of low energy configurations or the distribution of data along the optimal direction(s) 
become self-averaging; they do not depend on the detailed realization of £ M or <r M . 
Therefore we can compute these observables by averaging log Z over these realizations. 
This can be done by first averaging the replicated partition function 



where A^ = ^=w a -£ M . (For the case of perceptron learning, we can make the redefinition 
<t m £ M — since both have the same distribution; in essence we absorb the sign of the 
desired output into the input yielding only positive examples.) Averaging over £ M then 
reduces to averaging over A^. These variables are jointly Gaussian distributed with 
zero mean and covariance matrix ((A^A^)) = Q a b^v where Q a b = ^w a ■ w b is the 
replica overlap matrix. Thus after averaging over A^, the integrand depends on the 
configuration of replicated weights only through their overlap. Therefore it is useful 
to separate the integral over w a into an integral over all possible overlaps Q ao , and an 
integral over all configurations with the same overlap. Following section [HI this yields 




(58) 




ab 



(59) 
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where 

E(Q) = -a In / TT —L^e-I^^E^^) (60) 



0=1 



and 



S(Q) = ^Tr log Q (61) 

is the entropy of the volume of weight vectors with overlap matrix Q. For example, for 
perceptron learning when V(A) = B{— A), in the zero temperature limit /3 — >• oo, is 
an energetic term that promotes the alignment of the replicated weights so that they all 
yield the correct answer on any given set of examples (i.e. A a > for all a), while S(Q) 
is an entropic term that promotes replicated weight configurations with small overlaps, 
since they have larger volume. 

At large N the integral over Q a b can be done via the saddle point method, and 
the competition between entropy and energy selects a saddle point overlap matrix. We 
make the ansatz that the saddle point has a replica symmetric form Q a b = {I — q)5 a b + q. 
Given the connection (explained in section |8T2|) between replica overlap matrix elements, 
and the distribution of overlaps of pairs of random weights w drawn independently from 
( 153|) . this choice suggests the existence of a single free energy valley. This is reasonable 
to expect as most of the energy functions we will be analyzing for unsupervised learning 
are convex. Also, in the zero temperature limit, this ansatz suggests that the space 
of ground state energy configurations, if degenerate, should form a convex, connected 
set. This is indeed true for perceptron learning, since the space of ground states is the 
intersection of a set of P half-spheres (see Fig. Thus unlike the SK model, we expect 
a replica symmetric assumption to be a good approximation. 

Taking the n — > limit yields a saddle point equation for q which, as explained in 
section 18.3.21 can be derived from extremizing a free energy 



F( q ) = a{{ln()) z +~ 



'' + ln(l - q) 



l-q 



(62) 



C = / , dX e^^^W (63) 
' /2tt(1 - q) 



where 



is the partition function of the distribution appearing inside the average over z in (j!55p . 
Now in the case of perceptron learning, 1 — q reflects the typical volume of the solution 
space to (loTj) (see Fig. HP), in that q in the j3 — > oo limit is the typical overlap 
between two zero energy synaptic weight configurations (see section [872]) . q arises from 
a minimization of the sum of two terms in F(q) in ( 1621) . The first term is an energetic 
term which is a decreasing function of q, reflecting a pressure for synaptic weights to 
agree on all examples (promoting larger q). The second term is an entropic term that is 
an increasing function of q, which thus promotes smaller values of q which reflect larger 
volumes in weight space. As a increases, placing greater weight on the first term in 
F(q), q increases as energy becomes more important than entropy. As shown in [80J, 
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q — » 1 as a — > a c = 2. Thus a perceptron with N weights can store at most 2N random 
associations. 

3.4- Perceptrons and Purkinje Cells in the Cerebellum 

Interestingly, in [HI] the authors developed a replica based analysis of perceptron 
learning and applied it to make predictions about the distribution of synaptic weights 
of Purkinje cells in the cerebellum. Indeed an analogy between the Purkinje cell and 
the perceptron was first posited over 40 years ago [821 183]. The Purkinje cell has one of 
the largest and most intricate dendritic arbors of all neuronal cell types; this arbor is 
capable of receiving excitatory synaptic inputs from about 100,000 granule cells which, 
in areas of the cerebellum devoted to motor control, convey a sparse representation of 
ongoing internal motor states, sensory feedback, and contextual states. The Purkinje 
cell output in turn can exert an influence on outgoing motor control signals. In addition 
to the granule cell input, each Purkinje cell receives input from on average one cell in the 
inferior olive through a climbing fiber input, whose firing induces large complex spikes in 
the Purkinje cell as well as plasticity in the granule cell to Purkinje cell synapses. Since 
inferior olive firing is often correlated with errors in motor tasks, climbing fiber input 
is thought to convey an error signal that can guide plasticity. Thus at a qualitative 
level, the Purkinje cell can be thought of as performing supervised learning in order to 
map ongoing task related inputs to desired motor outputs, where the desired mapping is 
learned over time using error corrective signals transmitted through the climbing fibers. 

Now the actual distribution of synaptic weights between granule cells and Purkinje 
cells has been measured [M], and a prominent feature of this distribution is that it has 
a delta function at 0, while the rest of the nonzero weights follow a truncated Gaussian 
distribution. In particular about 80 percent of the synaptic weights are exactly 0. If 
the Purkinje cell is implementing an important sensorimotor mapping, why then are a 
majority of the synapses silent? In general, the distribution of synaptic weights in a 
network should reflect the properties of the learning rule as well as the statistics of inputs 
and outputs. Thus one might be able to quantitatively derive the distribution of weights 
by positing a particular learning rule and input-output statistics and then derive the 
weight distribution. However, the authors of [HI] took an even more elegant approach 
that did not depend on even positing any particular learning rule. They simply modeled 
the Purkinje cell architecture as a perceptron, assumed that it operated optimally at 
capacity, and derived the distribution of synaptic weights of perceptrons operating at 
capacity via a replica based Gardner type analysis. Remarkably, for a wide range of 
input-output statistics, whenever the perceptron implemented the maximal number of 
input-output associations at a given level of reliability (its capacity), its distribution 
of synaptic weights consisted of a delta-function at plus a truncated Gaussian for 
the nonzero weights. Indeed, like the data, a majority of the synapses were silent. 
This prediction only relies on the perceptron operating at (or near) capacity, and does 
not depend on the learning rule; any learning rule that can achieve capacity would 
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necessarily yield such a weight distribution. 

The key intuition for why a majority of the synapses are silent comes from the 
constraint that all the granule cell to Purkinje cell synapses are excitatory. Thus the 
either the Purkinje cell or the perceptron faces a difficult computational task: it must 
find a nonnegative synaptic weight vector that linearly combines nonnegative granule 
cell activity patterns and fires for some fraction of granule cell patterns while not firing 
for the rest. It turns out that false positive errors dominate the weight structure of the 
optimal perceptron operating at or near capacity: there are many granule cell activation 
patterns for which the perceptron must remain below threshold and not fire, and the 
only way to achieve this requirement with nonnegative weights is to set many synapses 
exactly to zero. Indeed by quantitatively matching the parameters of the replica based 
perceptron learning theory to physiological data, the capacity of the generic Purkinje 
cell was estimated to be about 40,000 input-output associations, corresponding to 5 
kilobytes of information stored in the weights of a single cell [81] . 



3. 5. Illusions of Structure in High Dimensional Noise 

In contrast to perceptron learning, in the applications of the statistical mechanics 
formulation in f )52|) and f )53|) to unsupervised learning discussed here, ^(A) has a unique 
minimum leading to a non-degenerate ground state. Thus in the zero temperature 
/3 — > oo limit we expect thermal fluctuations in the synaptic weights, reflected by 1 — q, 
to vanish. Indeed we can find self consistent solutions to the extremization of F(q) in 
(1621) by assuming 1 — g = j as /3 — >■ oo with A remaining 0(1). In this limit, (l6"2j) and 
f )63|) become 

±F(A) = -«(( min A [^f + V(A)] ))^ + ^- (64) 

Extremization of (164")) over A determines the value of A as a function of a. 

Furthermore, the interesting observable for unsupervised learning is the distribution 
of alignments across examples with the optimal weight vector, 

^(A) = ^£<*(A-A"), (65) 

where = -^=w-£ M , and w minimizes E(w) in ( l52i) . This is essentially the 
distribution of the data £ M along the dimension discovered by unsupervised learning. 
This distribution is derived via the replica method in section 18741 at finite temperature, 
and is given by (11551) . Its zero temperature limit yields 

P(A) = «<5(A-A*(z,A)» 2 , (66) 

where 

\\-zf 



X*(z, A) = argmin A 



+ V(X) 



(67) 



2A 

Equations (I64")) . (1661) and (1671) have a simple interpretation within the zero 
temperature cavity method applied to unsupervised learning [85, 86J. Consider a cavity 



CONTENTS 



31 



system in which one of the examples, say example 1 in (152 p is removed, and let w^ 1 be 
the "cavity" weight vector which optimizes E(w) in the presence of all other examples 
£ M for /i = 2, . . . , P. Since w^ 1 does not know about about the random example £ 1 , its 
overlap with this example, z = ^pw^'^ 1 is a zero mean unit variance random gaussian 
variable. Now suppose example 1 is then included in the unsupervised learning problem. 
Then upon re- minimization of the total energy E(w) in the presence of the weight 
vector w^ 1 will change to a new weight vector, and consequently its alignment with £ x 
will also change from z to an optimal alignment A*. It can be shown [86] that for large 
N and P, this new optimal alignment arises through the minimization in (I67|) . This 
minimization reflects a competition between two effects; the second term in ( 1671) favors 
optimizing the alignment with respect to the new example, but the first term tries to 
prevent changes from the original alignment z. This term arises because w was already 
optimal with respect to all the other examples, so any changes in w incur an energy 
penalty with respect to the old examples. The parameter A plays the role of an inverse 

A Random Projection B Hcbbian Learning 
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Figure 5. Illusions of structure. P — 2000 random points in N = 1000 dimensional 
space (so a = 2) are drawn from a structureless zero mean, identity covariance Gaussian 
distribution. These points are projected onto different directions. (A) A histogram of 
the projection of these points onto a random direction; in the large N, P limit this 
histogram is Gaussian with mean and unit variance. (B) A histogram of the same 
point cloud projected onto the Hebbian weight vector. (C) A projection onto the 
principal component vector. (D) The same point cloud projected onto two clusters 
directions found by if -means clustering with K = 2. 
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stiffness constant which determines the scale of a possible realignment of a weight vector 
with respect to a new example, and a self-consistency condition for A can be derived 
within the cavity approximation and is identical to the extremization of ( |64l) . This 
extremization makes A implicitly a function of a, and it is usually a decreasing function 
of a. Thus unsupervised learning becomes stiffer as a, or the number of examples, 
increases and the weight vector responds less to the presentation of any new example. 
Finally, example 1 is not special in any way. Thus repeating this analysis for each 
example, and averaging over the gaussian distribution of alignments z before learning 
any example, yields the distribution of alignments across examples after learning in ( |66|) . 

We can now apply these results to an analysis of illusions of structure in high 
dimensional data. Consider an unstructured dataset, i.e. a random gaussian point 
cloud consisting of P points, £ x , . . . ,£ p in N dimensional space, where each point ^ 
is drawn i.i.d from a zero mean multivariate gaussian distribution whose covariance 
matrix is the identity matrix. Thus if we project this data onto a random direction w, 
the distribution of this projection A M = ^f w '£'' across examples fi will be a zero mean 
unit variance gaussian (see Fig. [5A). However, suppose we performed Hebbian learning 
to find the center of mass of the data. This corresponds to the choice V^(A) = —A, 
and leads to A* (z, A) = z + A from (167|) with A = from Thus Hebbian 

learning yields an additive shift in the alignment to a new example whose magnitude 
decreases with the number of previous examples as After learning, we find that the 
distribution of alignments in f )66|) is a unit variance gaussian with a nonzero mean given 
by (see Fig. EB). Thus a high dimensional random gaussian point cloud typically 
has a nonzero center of mass when projected onto the optimal Hebbian weight vector. 

Similarly, we could perform principal components analysis to find the direction of 
maximal variance in the data. This corresponds to the choice ^(A) = —A 2 and leads 
through dHTJ) and flMD to X*(z, A(a)) = (1 + -j^)z. Thus PCA scales up the alignment 
to a new example, and flBB]) leads to a gaussian distribution of alignments along the 
principal component with zero mean, but a standard deviation equal to 1 + (see Fig 
ED). This extra width is larger than any unity eigenvalue of the covariance matrix and 
leads to an illusion that the high dimensional gaussian point cloud has a large width 
along the principal component direction. 

Finally, K-means clustering for K = 2, defined by the energy function in ( 1561) . 
involves a projection of the data onto two dimensions, determined by the two cluster 
centroids. However, the form of this energy function in ( 1571) reveals a lack of interaction 
between the projected coordinates A + = Ai + A 2 and A_ = Ai + A 2 . Along the direction 
A + , the algorithm behaves like Hebbian learning, so we should expected a gaussian 
distribution of the data along Ai + A 2 with a mean of However, along Ai — A 2 
the algorithm is maximizing the absolute value of the projection, so that V(X) = — |A|. 
With this choice, (loTI) yields \*(z, A) = z + sgn(2;)A with A = ^ determined by 
( |64|) . Note that this implies that the distribution of alignments in f )66|) has a gap of 
zero density in the region — -7= < A < -7= and outside this region, the distribution is 
a split gaussian. The joint distribution of high dimensional data in K-means clustering 
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factorizes along Ai + A 2 and Ai — A 2 and does indeed have a gap of width along 
the Ai — A 2 direction (see Fig. [5p). So quite remarkably, K- means clustering (with 
K = 2) of a random high dimensional gaussian point cloud reveals the illusion that 
there are 2 well separated clusters in the cloud. There is not a perfect match between 
the replica symmetric theory and numerical experiments because the discontinuity in 
the derivative of the energy in fl57|) actually leads to replica symmetry breaking |87j. 
However the corrections to the replica symmetric result are relatively small, and replica 
symmetry is a good approximation in this case; in contrast it is exact for Hebbian 
learning and PCA (see e.g. Fig 03, C). 

In summary, Fig. reveals different types of illusions in high dimensional data 
whose effects diminish rather slowly as 0{-^=) as the amount of data a increases. Indeed 
it should be noted that the very ability of the perceptron to store random patterns also 
depends on a certain illusion of structure: P random points in an N dimensional space 
will typically lie on one side of some hyperplane as long as a = P/N < 2. 

3.6. From Message Passing to Synaptic Learning 

We have seen in section [37T] that a perceptron with N synapses has the capacity to learn 
P random associations as long as a = P/N < a c = 2. But what learning algorithm can 
allow a perceptron to learn these associations, up to the critical capacity? In the case 
of analog valued synaptic weights that we have been discussing, a simple algorithm, 
known as the perceptron learning algorithm (T3J [H] can be proven to learn any set 
of associations that can be implemented (i.e. those associations a M } for which a 
solution weight vector to ( l5Tj) exists). The perceptron learning algorithm iteratively 
updates a set of randomly initialized weights as follows (for simplicity, we assume, 
without loss of generality, that 0^ = 1, for all patterns). 

• When presented pattern /i, compute the current input / = w-£ M . 

• Rule 1: If / > 0, do nothing. 

• Rule 2: If / < 0, update all weights: Wj — > Wj+£f . 

• Iterate to the next pattern, until all patterns are learned correctly. 

Such an algorithm will find realizable solutions to (15T1) in finite time for analog 
synaptic weights. However, what if synaptic weights cannot take arbitrary analog 
values? Indeed evidence suggests that biological synapses behave like noisy binary 
switches [HE, ES], and thus can reliably code only two levels of synaptic weights, rather 
than a continuum. The general problem of learning in networks with binary weights (or 
weights with a finite discrete set of values) is much more difficult than the analog case; 
it is in fact an NP-complete problem [T5| [16] ■ An exact enumeration and theoretical 
studies have revealed that when weights are binary (say Wj = ±1), the perceptron 
capacity is is reduced to a c = 0.83, i.e. the space of binary weight vector solutions 
to ( 15T|) is nonempty only when a = P/N < a c = 0.83 [901 EE]- Of course, below 
capacity, one can always find a solution through a brute-force search, but such a search 
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will require a time that is exponential in N. It is unlikely to expect to find a learning 
algorithm that provably finds solutions in a time that is polynomial in N, as this would 
imply P = NP. However, is it possible to find a learning algorithm that can typically 
(but not provably) find solutions in polynomial time at large a < 0.83, and moreover, 
can this algorithm be biologically plausible? 

The work of JT7J [18] provided the first such algorithm. Their approach was to 
consider message passing on the joint probability distribution over all synaptic weights 
consistent with the desired associations (again we assume = 1): 

^(w) = ^n#(w-e). (68) 

Here the factors are indexed by examples. The messages from examples to synapses 
and synapses to examples are all distributions on binary variables, and therefore can 
be parameterized by real numbers, as = M jU ^(+l) — M m _h(— 1), and M^^iwi) oc 
e hi ^ Wi . The message passing equations (I3"2"j) - (13"3"j) then yield a dynamical system on the 
variables u^i and h^^. This system drives the messages to approximate the marginal 
distribution of a synapse across all synaptic weight configurations which correctly learn 
all P associations. However, we would like to find a single synaptic weight configuration, 
not a distribution. To do this, in [18] the message passing equations are supplemented 
by a positive feedback term on the updates for This positive feedback amplifies 

the messages over time and forces the weights to polarize to a single configuration, 
so that the approximation to the marginals through ( 1341 becomes a delta function on 
±1 for all synapses i. Furthermore, in the large N limit, one can approximate the 
dynamical system on the 2PN variables u^i and via an approximate message 
passing dynamical system on the time dependent variable h\ = J2e<t Eu=i u u-h [171 US] . 
Thus one obtains a learning rule in which each synapse maintains a single analog hidden 
variable hi. 

This rule was further simplified by allowing hi to only take a finite number of 
discrete values, and the actual value of the synaptic weight was related to the hidden 
variable via W{ = sgn(/tj) [18]. After this simplification, the (amplified) message passing 
equations can be written in an online form in terms of the following algorithm (for 
convenience, hi is allowed to take only odd integer values to avoid the ambiguous state 
hi = 0): 

• For pattern //, compute the current input / = w-£ M , where Wi = sgn(/ij). 

• Rule 1: If / > 1, do nothing. 

• Rule 2: If / < 0, update all internal states : hi — >• hi + 2£f. 

• Rule 3: If J = 1, then update each internal state hi — > /ij + 2£f, but only if h^ > 1. 

• Iterate to the next pattern, until all patterns are learned correctly. 

The resulting rule is quite similar to the perceptron learning rule above, except for 
the modification of rule 1 and the addition of rule 3. Rule 3 concerns situations in which 
pattern \i is barely correct, i.e. a change in a single synaptic weight, or a single pattern 
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component would cause I to be below threshold (which is 0), resulting in an error. 
For barely learned patterns, rule 3 reinforces those internal variables that are already 
pointing in the right direction, i.e. contributing positively to the input current / on 
pattern /i, by making them larger in absolute value. Note that rule 3 cannot change any 
synaptic weight wf, it is thus a metaplastic rule [92], or a rule that changes an internal 
state of the synapse without changing its synaptic efficacy 

Remarkably, the addition of rule 3, while seeming to be an innocuous modification of 
the perceptron learning rule, turns out to have a large impact on the learning capabilities 
of the discrete perceptron. For example, for a neuron with iV = O(10 5 ) synapses, 
when a G [0.3. . .0.6], the message passing derived algorithm finds a solution with a 
few tens of presentations per pattern, whereas a similar clipped perceptron algorithm 
obtained by removing rule 3 is unable to find such a solution in O(10 4 ) presentations 
per pattern [18]. Given the remarkable performance of message passing, it is intriguing 
to speculate whether some signature of message passing may exist within synapses. The 
key prediction is that in neurons that learn via error signals, metaplastic changes should 
occur whenever an error signal is absent, but the neuron is close to threshold. 

4. Random Matrix Theory 

The eigenvalue distributions of large random matrices play a central role in a variety 
of fields [HI ED]- For example, within the context of neuroscience, these distributions 
determine the stability of linear neural networks, the transition to chaos in nonlinear 
networks [21] , and they are relevant to the statistical analysis of high dimensional data. 
Replica theory provides a powerful method to compute the eigenvalue spectrum of 
many different classical random matrix ensembles, including random symmetric [93] 
and asymmetric [91] matrices. More recently, it has been applied to matrices whose 
connectivity obeys Dale's law, which stipulates that all the outgoing synaptic weights 
of any neuron have the same sign [95]. Here we will introduce the replica formalism 
for symmetric matrices, focusing on the Wishart matrix ensemble [961 197] because of its 
applications to high dimensional statistics discussed in section 15.31 

4-1. Replica Formalism for Random Matrices 

Suppose W is an iV by iV random matrix whose elements are drawn from some 
probability distribution. For any specific realization of W, its eigenvalue distribution is 



where Z{ are the eigenvalues of W. Now for large N, and for many distributions on the 
matrix elements of W, this eigenvalue distribution is self-averaging; for any realization 
of W, it converges as N — > oo, to its average over W, which we denote by (( pw(A) )) w . 
We would like to theoretically compute this average, but it is difficult to average (|69|) 




(69) 
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directly, since the eigenvalues z% are complicated functions of the matrix elements of W 
(i.e. the roots of the characteristic polynomial det(z — W)). 

To perform this average, it is useful to physically think of the eigenvalues Zi as a 
collection of coulomb charges in the complex plane. In two dimensions, such charges 
repel each other with a force that decays inversely with distance. Then the resolvent, 

can be thought of as (the negative of) the electric force field on a test charge placed at a 
point z in the complex plane, due to the presence of all the other charges z\, . . . , zn- In 
mathematics, the transformation from pw( z ) t° Rw{ z ) i n (E2) is known as the Stieltjes 
transform. For the case of symmetric W, the charge density is confined to the real axis, 
and one can recover the charge density from its force field via the relation 

pw(z) — hm — Im Rw(z + ie). (71) 

e-s>0+ 7T 

See section [8751 for a derivation of this relation. Now the force on a test charge at a point 
z is the derivative of the electrostatic potential $w(z), and it turns out this potential, 
as opposed to either the charge density pw(z) or the electric force R^(z), will be easy 
to average over W. We can derive a simple expression for the potential via the following 
sequence: 

Rw(z) = -Jr Tr 



N W-z 
"!^log(W 

d 2 



dz N bg 



det (W 



d ®w(z), (72) 



where, 



dz 



$ w (z) = §log Z w (z) (73) 



and 

Z w (z) = I ,i ue -i« T (w-)u. (74) 



Here we have used a Gaussian integral representation of [det(z — W)] - ^ in f|74|) and 
neglected factors which do not survive differentiation by z in (|72l) . 

Now the electrostatic potential $w(^) is expressed in (1731) as the free energy of a 
partition function Z w (z) given by (1731) . We can use this representation to average the 
potential over W, via the replica method to appropriately take care of the logarithm: 

«*wM)) W = |:<(logZwM»w. 
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This yields a general procedure for computing the average eigenvalue spectrum (i.e. 
charge density) of random Hermitian matrices. We first average a replicated version of 
the partition function in ( 17^1) (see ( 1761) below). This allows us to recover the average 
electrostatic potential through ( 175]) . which then leads to the the average electric field 
through ( 1721) . which in turn leads to the average charge density through ( J7T1) . We 
note that although we have focused on the case of hermitian matrices, this analogy 
between eigenvalues and coulomb charges extends to non-hermitian matrices in which 
the eigenvalue density is not confined to the real axis. 



4-2. The Wishart Ensemble and the Marcenko-Pastur Distribution 

As seen in the previous section, the first step in computing the eigenvalue spectrum of 
a Hermitian random matrix involves computing the average 

^))) w = (( jn: = i^e-^ a ^(w-K^. (76) 

At this point we must choose a probability distribution over W. When the matrix 
elements of W are chosen i.i.d. from a gaussian distribution, one obtains Wigner's 
semicircular law [HE] for the eigenvalue distribution, which was derived via the replica 
method in [99J. 

Here we will focus on the Wishart random matrix ensemble in which 

W = 1 A T A, (77) 

where A is a P by N matrix whose elements are chosen i.i.d. from a zero mean, unit 
variance Gaussian. This matrix has a simple interpretation in terms of high dimensional 
data analysis. We can think of each row of the matrix A as a data vector in an N 
dimensional feature space. Each data vector, or row of A is then a single draw from 
a multivariate Gaussian distribution in N dimensions, whose covariance matrix is the 
identity matrix. W is then the empirical covariance matrix of the P samples in the 
data set A. In the low dimensional limit where the amount of data P — > oo and iV 
remains 0(1), the empirical covariance matrix W will converge to the identity, and its 
spectrum will be a delta- function at 1. However, in the high dimensional limit in which 
P, N — > oo and a = P/N = 0(1), then even though on average W will be the identity 
matrix, fluctuations in its elements are strong enough that its eigenvalue spectrum for 
typical realizations of the data A will not converge to that of the identity matrix. Even 
when a > 1, the case of interest here, there will be some spread in the density around 
1, and this spread can be thought of as another illusion of structure in high dimensional 
data, which we now compute via the replica method. 
Inserting (!77j) into (176"]) we obtain 



^))w=/n/ ua 



e 2 

A 



.(78) 

W J "=! L \ \ / / A J 

Now the integrand depends on the quenched disorder A only through the variables 
AjJ = -Jjf a fi ' u a ; where is row /i of the matrix A. These variables are jointly gaussian 
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— u° 

N 



distributed with zero mean and covariance U A^A|J, )/ = Qab^^v where Q 
Thus the average over A can be done by a gaussian integral over the variables A^ 



g 2 a ^2a zZ^(^«) 



p 



det (/ + -gp 

-JVf Trlog(/+iQ) 



• IT. 

(79) 

(80) 

(81) 
(82) 



Here in going from ( 1791) to ( IHUl) . we have exploited the fact that the variables A^ are 
uncorrelated for different //, yielding a single average over variables A a with covariance 
((A a Ab}) = Qab, raised to the power P. In going from (180]) to ( 182]) we performed the 
gaussian integral over A a . 

Thus consistent with the general framework in section I8.lt averaging over the 
disorder introduces interactions between the replicated degrees of freedom u a which 
depend only on the overlap matrix Q ab . Therefore we can compute the remaining 
integral over u a in ( 178]) by integrating over all overlaps Q ab , and integrating over all 
configurations of u a with a given overlap Q. This latter integral yields an entropic 
factor that depends on the overlap. In the end ( 178]) becomes 

-N(E(Q)-S(Q)) 



w 



ab 



where 



and 



E(Q) 



^Trlog(/+-g)-^Trg, 
2 a I 



1 



S(Q) = -TrlogQ, 



(83) 



34) 



J5) 



is the usual entropic factor. The first term in (!8~4"]) comes from (!82l) while the second 
term in (1MI) comes from the part outside the average over A in (178"]) . 

Now the final integral over Q ab can be done via the saddle point method, and the 
integral can be approximated by the value of the integrand at the saddle point matrix 
Q which extremizes F{Q) = E{Q) — S{Q). We can make a decoupled replica symmetric 
ansatz for this saddle point, Q ab = qS ab . With this choice, (175]) leads to the electrostatic 
potential 

(($ w (*) »w = -«log(l + -q) + izq + logg (86) 

a 

and ( 172]) leads to the electric field 

«iM*)» w = W- ( 8T ) 

Here q satisfies the saddle point equation obtained by extremizing F(q), or equivalently 

the right hand side of ( 186]) . 

a 1 , 
+ z + - = 0. 88 

a + iq iq 
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This is a z dependent quadratic equation for iq, and due to the relation between the 
electric field and charge density in (I7ip . we are interested in those real values of z for 
which the solution iq has a nonzero imaginary part. It is in these regions of z that charges 
(eigenvalues) will accumulate, and their density will be proportional to this imaginary 
part. In the regime in which a > 1 (so we have more data points than dimensions), 
a little algebra shows that iq has an imaginary part only when z_ < z < z + where 
z± = (1 ± ^j) 2 - In this region the charge density is 



aJ(z - z)(z + - z) 
(( Pw(z) » w = -, (89) 

which is the Mar cenko- Past ur (MP) distribution (see Fig. [BJA. below). Thus due to the 
high dimensionality of the data, the eigenvalues of the sample covariance matrix spread 
out around 1 over a range of O(i-). This illusory spread becomes smaller as we obtain 
more data (increased a). 



4-3. Coulomb Gas Formalism 

In the previous section we found the marginal density of eigenvalues for a Wishart 
random matrix, but what about the entire joint distribution of all N eigenvalues? 
This distribution has a physically appealing interpretation that provides intuition for 
applications in high dimensional statistics discussed below. Consider the distribution of 
W = A T A, i.e. the matrix in ( I77|) without the p scaling. Because the P by N matrix 
A has i.i.d zero mean unit variance Gaussian elements, the distribution of A is given by 

P(A) oce"^ TrATA , (90) 

Now each matrix A has a unique singular value decomposition (SVD), A = USV T , 
where U and V are unitary matrices and S is a P by N matrix whose only iV nonzero 
elements are on the diagonal: Ejj = <t,. The cr^'s are the singular values of A, and the 
eigenvalues Aj of W are simply the square of these singular values. Thus to obtain the 
joint distribution for Aj, we first perform the change of variables A = USV T in the 
measure f )90|) . 

Fortunately, -P(A) is independent of U and V, and depends only on S. However, 
we need to transform the full measure -P(A) dA, and therefore we must include the 
Jacobian of the change of variables, given by (see e.g. [100]) 

N 

^ = n (°? - o?) n o-r N (u T du)(dx)(v T dv). ( 9 i) 

i<j i=l 

Now the angular variables U and V decouple from the singular values <jj, so we can 
integrate them out, yielding a constant. Furthermore, we can perform the change of 
variables Aj = af to obtain 

N 

P(A 1; . . . , \ N ) oc c -JX£i* J] \f P - N - 1] J] |A, - A fc |. (92) 

i=l j<k 
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Here the first factor in the product arises from -P(A) in (1901) while the second two factors 
arise from the Jacobian incurred by the change of measure in (19T1) . This joint distribution 
can be written as a Gibbs distribution at unit temperature, P({A;}) oc e~ E ^ Xi ^\ where 
the energy is 



This energy function has a simple interpretation in which each eigenvalue is a Coulomb 
charge on confined to the real axis on the 2D complex plane. Each charge moves 
in a linear plus logarithmic potential which confines the charges, and there is a 
pairwise repulsion between all charges governed by a logarithmic potential (the Coulomb 
interaction in two dimensions) . The Coulomb repulsion balances the confinement due to 
the external potential when the charges, or eigenvalues spread out over a typical range 
of O(N). More precisely, this range is given by (1 ± ^Ja) 2 N where a = P/N (note this 
is consistent with z± defined above ( 1891) after rescaling by -p), and within this range, 
the charge density in the N — » oo limit is given by the MP distribution. 

4-4- Tracy-Widom Fluctuations 

In the previous sections we have seen that full joint distribution of eigenvalues behaves 
like a Coulomb gas and its typical density at large N is given by the MP distribution. 
However, what do typical as well as large fluctuations of the maximal eigenvalue, or 
right most charge behave like? The distribution of the maximal eigenvalue forms a null 
distribution to test for the statistical significance of outcomes in PCA, and also plays 
role in random dimensionality reduction, so its fluctuations are of great interest. The 
mean of the maximal eigenvalue Xmax of course lies at the end of the MP charge density 
and is given, to leading order in N, by (Xmax) — (1 + y/a) 2 N. Typical fluctuations 
about this mean have been found to scale as O(N^) [23]. More precisely, for large N 
they have the limiting form 



where x is the Tracy-Widom distribution that has a range of 0(1) [24J. 

The computation of these typical fluctuations is involved, but often we are interested 
in the probability of large deviations in which \\max — (Xmax)\ = O(N). These large 
deviations were computed in (26J, [27] in a very simple way using the Coulomb gas 
picture. Suppose for example the largest eigenvalue occurs at distance that is O(N) 
to the right of the typical edge of the MP density, (1 + y/a) 2 N. The most likely way 
this could happen (i.e. the saddle point configuration of charges in ( 192]) [27]). is that 
a single eigenvalue pops out of the MP density, while the remaining eigenvalues are 
unperturbed, and preserve the shape of the MP density. The energy paid by a single 
eigenvalue popping out of the MP density is dominated by the linear confining term in 
f )93]) . and is therefore proportional to the distance it pops out. Since the probability of 



1 1V 

i?=T(A,-(F-]V-l)ln Aj) - £ In | A, - X k 



(93) 




Xmax = (1 + \fa) 2 N + a"s(l + y/a)*N*x, 



(94) 
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Figure 6. Spectral distributions of empirical noise covariance matrices. (A) P — 2000 
random points in N — 1000 dimensional space (a — 2) are drawn from a mean 
identity covariance normal distribution. The blue histogram is the distribution of 
eigenvalues of the empirical covariance matrix W = -p A T A, where A is a P by N 
data matrix whose rows correspond to the points (see Eq. ([77)) ). The red curve is 
the Marcenko-Pastur distribution (see Eq. ([55)) ) for a = 2. (B) A histogram, in blue, 
of the maximal eigenvalues of 1000 random covariances matrices W, each constructed 
exactly as in (A). The red curve is the Tracy- Widom distribution in (IM| for a = 2, 
rescaled by -p. The dashed red line marks the edge of the Marcenko-Pastur distribution 
in (A). The discrepancy between this edge and the mean of the maximal eigenvalue 
distribution is a finite size effect; this discrepancy, like the fluctuations in the maximal 
eigenvalue, vanishes as 0(A~ 2 / 3 ). 



a fluctuation is exponentially suppressed by its energy cost (here entropy plays no role 
because the MP density is unperturbed), we obtain 

Proh(\ MAX = (\ MAX ) + cN)<xe- N * +(c) for cN ^> (X MAX ). (95) 

Thus large deviations of O(N) are exponentially suppressed by N, and the 0(1) large 
deviation constant $+(c) can be computed explicitly by quantitatively working out this 
Coulomb g gument [261 E7J. 

On the other hand, suppose that the maximal eigenvalue A max occurs at a distance 
cN to the left of right edge of the MP density. In order for this fluctuation to occur, 
the entire MP density must become compressed, incurring a much larger energy cost 
compared to a positive or rightward fluctuation of A m ax- Indeed because of the Coulomb 
repulsion between all pairs of charges in ( 1931) . the energy cost of compression is 0(N 2 ) 
leading to the stronger suppression, 

Proh(\ MAX = (\ MAX }-cN)(xe- N2 ' s, -( c) for cN > (Xmax), (96) 

where $_(c) is an 0(1) large deviation function computed in [261 [27]. Thus the physics 
of Coulomb gases gives a nice explanation for the asymmetry in the large deviations of 
the Tracy- Widom distribution. 
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For the reader's convenience, we summarize the implications of the Coulomb gas 
formalism for high dimensional statistics by reintroducing the 4 scaling in the definition 
( 1771) of the empirical covariance matrix. The above results then tell us that the 
maximal eigenvalue of the empirical covariance matrix of P random Gaussian points 
in N dimensional space, in the limit N, P — > oo with a = P/N remaining 0(1) (but 
strictly greater than 1), has a mean (Xmax) — (1 + ^=) 2 with typical fluctuations about 
this mean that are 0{N~z) (see Fig. [6j3) . Moreover, the probability of large 0(1) 
positive deviations of Xmax are 0(e~ N ), while the probability of large 0(1) negative 
deviations of Xmax are 0(e~ N2 ). 



5. Random Dimensionality Reduction 

We have seen in section 13.51 that we may need to be careful when we perform 
dimensionality reduction of high dimensional data by looking for optimal directions 
along which to project the data, as this process can potentially lead to illusions of 
structure. An alternate approach might be to skip the optimization step responsible for 
the illusion, and simply project our data onto randomly chosen directions. However, 
it is not at all obvious that such random dimensionality reduction would preserve the 
true or interesting structure that is present in the data. Remarkably, a collection of 
theoretical results reveal that random projections (RP's) preserve much more structure 
than one might expect. 



5.1. Point Clouds 

A very generic situation is that data often lies along a low dimensional manifold 
embedded in a high dimensional space. An extremely simple manifold is a point cloud 
consisting of a finite set of points, as in Fig. [7A. Suppose this cloud consists of P points 
s a , for a = 1, . . . , P, embedded in an N dimensional space, and we project them down to 
the points x™ = As" in a low M dimensional space through an appropriately normalized 
random M x N random projection matrix A. The squared euclidean distances between 
pairs of points in the high dimensional space are given by ||s Q — s^|| 2 and in the low 
dimensional space by ||x a — x^ 2 . The fractional distortion in the squared distance 
incurred by the projection is given by 

|| x a_ x /3||2_ || S "_ S 0||2 

D <# = - jjiSbjfi — L - ^ 

How small can we make M before the point cloud becomes distorted in the low 
dimensional space, so that the low and high dimensional distances are no longer similar? 

The celebrated Johnson-Lindenstrauss (JL) lemma [101] (see |102l 1103 ] for more 
recent and simpler proofs) and provides a striking answer. It states that for any 
distortion level < 8 < 1, as long as M > 0(^f-), with high probability, one can 
find a projection such that 

-5 <D a(3 <5 (98) 




Figure 7. Random Projections. (A,B) Projection of a point cloud, and a nonlinear 
manifold respectively. (C) A manifold of if-sparse signals (red) in N dimensional space 
is randomly projected down to an M dimensional space (here K = 1, N = 3, M = 2). 

for all pairs of points a and /3. Thus the distortion between any pair of points rarely 
exceeds 5. This is striking because the number of projected dimensions M need only be 
logarithmic in the number of points P independent of the embedding dimension of the 
source data, N. Of course, with so few projections, one cannot reconstruct the original 
data from its projections. Nevertheless, surprisingly, with so few random projections the 
geometry of the entire point cloud is preserved. We will discuss a statistical mechanics 
based approach for understanding the JL lemma in section 15.31 

5.2. Manifold Reduction 

Consider data distributed along a nonlinear K dimensional manifold embedded in N 
dimensional space, as in Fig 03. An example might be a set of images of a single object 
observed under different lighting conditions, perspectives, rotations and scales. Another 
example would be the set of neural firing rate vectors in a brain region in response to 
a continuous family of stimuli. |104[ |105[ 1106] show that M > 0(J| logiVC) random 
projections preserve the geometry of the manifold up to distortion 5. Here C is a number 
related to the curvature of the manifold, so that highly curved manifolds require more 
projections. Overall, the interesting result is that the required number of projections 
depends linearly on the intrinsic dimensionality of the manifold, and only logarithmically 
on its ambient embedding dimension. 

The simplest finite dimensional manifold is a K dimensional linear subspace in an 
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iV dimensional space. It can be shown |1U7] that M > O(j^) RP's are sufficient to 
preserve all pairwise distances between data points within a distortion level 5. We will 
give an alternate proof of this result in section 15.31 below using the results of sections 
14.21 and 14.31 Of course, for such a simple manifold, there exists a nonrandom, optimal 
geometry preserving projection, namely the PCA basis consisting of M = K orthogonal 
vectors spanning the manifold. Thus we pay a price in the number of projections for 
choosing random projections rather than the optimal ones. Of course, for data that is 
not distributed along a hyperplane, a PCA based projection will no longer be optimal, 
and will not generically preserve geometry. 

Sparsity is another example of an interesting low dimensional structure. Consider 
for example, the (nonsmooth) manifold of iV dimensional signals with only K nonzero 
components. This is a manifold of (fy coordinate hyperplanes in N dimensional 
space, as in Fig. [7D. The geometry of this manifold can also be preserved by 
random projections. In particular |107j shows that random projections down to an 
M > 0(j& log |?) dimensional space preserve the distance between any pair of .ff -sparse 
signals, with distortion less than S. 

Beyond the issue of preserving geometry under an RP, one might be interested in 
situations in which one can invert the random projection; i.e. given the projection of a 
data vector, or signal in the low M dimensional space, how might one recover the original 
signal in the high N dimensional space? For the case of point clouds (Fig. [7A) and 
general nonlinear manifolds (Fig. 03), there are no general computationally tractable 
algorithms capable of achieving this high dimensional signal recovery. However, for the 
case of if-sparse signals (Fig. WP), there exists a simple, computationally tractable 
algorithm, known as L\ minimization, reviewed below, that can provably, under 
various assumptions on the RP matrix A, recover the high dimensional signal from 
its projection. It turns out that geometry preservation is a sufficient condition for 
signal recovery; in particular, |108] shows that any projection which preserves the 
geometry of all A"-sparse vectors, allows one to reconstruct these vectors from the low 
dimensional projection, efficiently and robustly using L\ minimization. This is one of 
the observations underlying the field of compressed sensing, reviewed below. 

However, even in situations where one cannot accurately reconstruct the high 
dimensional signal from its projection, RP's can still be very useful by allowing for 
compressed computation directly in the low dimensional projection space, without the 
need for signal recovery. This can be done because many interesting machine learning 
and signal processing algorithms depend only on pairwise distances between data points. 
For example, regression |109j . signal detection |110j . classification |11R 11121 II 13|. ITT4] . 
manifold learning [115] . and nearest neighbor finding [102J can all be accomplished 
in a low dimensional space given a relatively small number of RP's. Moreover, task 
performance is often comparable to what can be obtained by performing the task directly 
in the original high dimensional space. The reason for this remarkable performance is 
that these computations rely only on the distances between data points, which are 
preserved by RP's. 
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5.3. Correlated Extreme Value Theory and Dimensionality Reduction 

The proofs |101l 1102} I103|. I10T|. 1104] behind the remarkable sufficient conditions for low 
distortion of submanifolds under RP's rely on sequences of potentially loose inequalities. 
Thus they leave open the question of whether these sufficient conditions are actually 
necessary, and how the typical, as opposed to worst case, distortion behaves. Is it 
possible to use a direct approach, more in the spirit of statistical mechanics, to simply 
compute the probability distribution of the typical distortion of random manifolds under 
random projections? Here the random choice of manifold plays the role of quenched 
disorder, the random choice of projection plays the role of thermal degrees of freedom, 
and the observable of interest is the distribution, across choices of RP's, of the maximal 
distortion over all pairs of points in the fixed manifold. The hope is that this distribution 
is self-averaging, in that it does not depend on the choice of a particular manifold from 
a suitable ensemble of manifolds. In general, this approach is challenging, but here we 
show that it can be carried out for two very simple classes of manifolds: random point 
clouds and hyperplanes. Our main goal in this section is simply to obtain intuition for 
the scaling behavior of some of the inequalities discussed in the previous section. 

Consider a fixed realization of a Gaussian point cloud consisting of P points s a , 
a = 1, . . . , P in N dimensional space. Let A be an M by N random projection 
operator whose matrix elements are i.i.d Gaussian with zero mean and variance -h, 
and let x a = As" be the low dimensional image of the cloud. With this choice of 
scaling for the variance of the projection operator, it is straightforward to show that 
any one distortion D a p in ( 1971) is in the large N limit, for a fixed (i.e. quenched) 
point cloud, a Gaussian random variable with zero mean and variance O(-k), due to 



distortions, and we are interested in the maximum distortion, whose behavior could in 
principle depend on the correlations between pairs of distortions. For random Gaussian 
point clouds, the correlation coefficient between two pairs of distortions are weak, in 
fact O(jj), and can be neglected. In this manner, the ambient dimensionality of the 
point cloud disappears from the problem. Thus the maximum distortion over all pairs 
of points can be well approximated by the maximum of 0(P 2 ) independent Gaussian 
variables each with variance O(j^). In general the maximum of R independent Gaussian 
variables with variance a 2 approaches a Gumbel distribution in the large R limit, 
and takes typical values of 0(a\/\n R). Indeed the Gumbel distribution is a universal 
distribution governing extreme values of any random variables whose tails vanish faster 
than exponentially [116] : this strong suppression of extreme values in any single variable 
directly leads to an extremely slow A/In R growth of the maximum over R realizations 
of such variables. Applying this general result with a 2 = and R = 0(P 2 ) yields 

the conclusion that the maximal distortion over all pairs of points obeys a Gumbel 
distribution, and its typical values scale with P and M as O(y^p). Thus the origin of 
the extremely slow s/YaP growth of the maximal distortion with the number of points 
P is due to the strong, Gaussian suppression of any individual distortion. This effect 



the random choice of A. Now there 




possible 
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is directly responsible for the remarkable JL lemma. For example, if we desire our 
maximal distortion to be less than 5, we must have 




or equivalently M > {^f-j, which, up to constants, is the JL result. Thus extreme 
value theory for uncorrelated Gaussian variables provides a natural intuition for why 
the number of random projections M need only be logarithmic in the number of points 
P, and independent of the ambient dimension N, in order to achieve an 0(1) distortion 
5. 

For random Gaussian point clouds, we were able to neglect correlations in the 
distortion between different pairs of points. For more general manifold ensembles, we 
will no longer be able to do this. However, for the ensemble of random hyperplanes, an 
exact analysis is still possible despite the presence of these correlations. Let U be an 
N by K random matrix whose K orthonormal columns form a basis for a random K 
dimensional subspace of N dimensional space (drawn uniformly from the space of such 
subspaces). What is the distribution of the maximal distortion in (1971) where s a and 
range over all pairs of points in this subspace? First, by exploiting rotational invariance 
of the ensemble of A and U, we can always perform a change of basis in N dimensional 
space so that the columns of U are mapped to the first K coordinate axes. Thus 
points in the hyperplane can be parameterized by N dimensional vectors whose only 
nonzero components are the first K coordinates, and the statistics of their projection 
to M dimensional space can be determined simply by the M by K submatrix of A 
consisting of its first K columns. In this manner, the dimensionality iV of the ambient 
subspace again disappears from the problem. Second, by exploiting the linearity of the 
projection operator, to compute the maximal distortion over all pairs of points in the 
plane, it suffices to compute the maximal distortion over all points on the unit sphere 
in K dimensional space. Thus if we let A denote the M by K submatrix of A, and let 
s denote a i^-dimensional coordinate vector for the hyperplane, then we have 

max aj g D a p = max/ s> || s |i2 = H \/s T A T As — 1. (100) 

Here, to obtain a slightly cleaner final result, we are now measuring the distortion D a p 
in terms of fractional change in Euclidean distance as opposed to the squared Euclidean 
distance used in ( 1971) . hence the square root in ( 11001) . The constrained maximum over 
s of s T A T As in (11001) is simply the maximum eigenvalue of the matrix A T A, and its 
distribution over the random choice of A has been characterized in sections 14.21 and 
14.31 In fact the results in these sections carry over with the replacements P — > M and 
iV K. The maximal eigenvalue is with high probability equal to (1 + -^j) 2 , with 

a = jr. Its typical fluctuations are 0(M~§), while its large positive deviations of 0(1) 
are exponentially suppressed in M, i.e. are 0(e~ M ). So the maximal distortion in (11001) 
is close to -7=. As long as M > K, a similar argument holds for the minimum distortion, 

which will be close to — \=. Indeed, if M < K, then A T A will have zero eigenvalues, 
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which correspond geometrically to vectors in the random hyperplane U that lie in the 
kernel of the random projection A. So as long as a > 1, the distribution of distance 
distortions D a p will with high probability lie in the range — to This means 

of course, that if one wants all distortions D a/3 to obey —5 < D a/3 < +5, then one can 
achieve this with high probability as long as 5 < or equivalently, the number of 
random projections obeys M > J|, which proves the claim about RP's of hyperplanes 
made in section 15.21 Overall, this argument shows how the extremal fluctuations of 
correlated random variables (i.e. the charges of a Coulomb gas described in section 143)) 
can be used to understand geometric distortions induced by RP's of simple manifolds, 
namely hyperplanes. 

6. Compressed Sensing 

We have seen in section |5] that random projections can preserve the geometric structure 
of low dimensional signal manifolds. Furthermore, in the case in which the manifold is 
the space of if-sparse signals (Fig. EC), as discussed above, one can actually recover the 
high dimensional signal from its projection using a computationally tractable algorithm, 
known as L\ minimization. Here we review this algorithm and its analysis based on 
statistical mechanics and message passing. As discussed in the introduction, many 
applications of the ideas in this section and the previous one are described in [30] . 

6.1. Li Minimization 

Suppose s° is an unknown sparse N dimensional vector which has only a fraction 
/ = K/N of its elements nonzero. Thus s° is a point in the top manifold of Fig. 
EC. Suppose we are given a vector x of M < N measurements, which is linearly related 
to s° by an M by N measurement matrix A, i.e. x = As , x is then the projection 
of s° in the bottom manifold of Fig. [7C. Each measurement x^, for // = 1, . . . , M is a 
linear function a M • s° of the unknown signal s°, where a M is the /i'th row of A. In the 
context of signal processing, s° could be a temporal signal, and the a M could be a set 
of N temporal filters. In the context of network reconstruction, s° could be a vector of 
presynaptic weights governing the linear response of a single postsynaptic neuron x u to 
a pattern of presynaptic stimulation a M on a trial // [35] . 

Now how might one recover s° from x? In general, this is an underdetermined 
problem; there is an N — M dimensional space of candidate signals s that satisfy the 
measurement constraint x = As. The true signal s° is but one point in this large 
space. However, we can try to exploit our prior knowledge that s° is sparse by searching 
for sparse solutions to the measurement constraints. For example, one could solve the 
optimization problem 



N 




subject to x = As, 



(101) 



i=l 



CONTENTS 



48 



to obtain an estimate s of s°. Here V(x) is any sparsity promoting function. A natural 
choice is V(x) — 1 if x — and V(x) = otherwise, so that (jlOip yields the the signal 
consistent with the measurements x with the minimum number of nonzero elements. 
However, this is in general a hard combinatorial optimization problem. One could relax 
this optimization problem by choosing V(s) = \s\ p , so that fllOip finds a solution to the 
measurement constraints with minimal L p norm. However, this optimization problem 
is nonconvex for p < 1. Thus a natural choice is p = 1, the lowest value of p for 
which the recovery algorithm f llOll) becomes a convex optimization problem, known as 
Li minimization. 



6.2. Replica Analysis 

Much of the seminal theoretical work in CS |117l 11181 1108] has focused on sufficient 
conditions on A to guarantee perfect signal recovery, so that s = s° in fllOip . in the 
case of L\ minimization. But often, large random measurement matrices A which 
violate these sufficient conditions nevertheless typically yield good signal reconstruction 
performance. Thus these sufficient conditions are not necessary. Here we review a 
statistical mechanics approach to CS based on the replica method |119l 11201 1121] , which 
allows one to directly compute the typical performance of Li minimization. Some of 
these results have also been derived using message passing [55] and polyhedral geometry 

rj22]. 

To understand the properties of the solution s to the optimization problem in (11011) . 
we define an energy function on the residual u = s — s° given by 

A N 

E(u) = -u r A r Au + + <|, (102) 

and analyze the statistical mechanics of the Gibbs distibution 

Pg(u) = ie-^W (103) 

By taking the limit A — > oo we enforce the constraint x = As. Then taking the low 
temperature — > oo limit condenses the Gibbs distribution onto the vicinity of the 
global minimum of fllOip . Then we can compute the average error 

1 N 

3o=^E< u *)p g , (104) 
iV i=i 

and, if needed, the thermal fluctuations 

1 N 



A ^ = ^E((^) 2 >p G (105) 

iV i=l 

Now, Pg, and therefore its free energy — (3F = logZ, average error Q , and 
fluctuations AQ all depend on the measurement matrix A and on the signal s°. We take 
these to be random variables; the matrix elements are drawn independently from 
a standard normal distribution, while s° has fN randomly chosen nonzero elements 
each drawn independently from a distribution P(s°). Thus A and s° play the role 
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of quenched disorder in the thermal distribution Pq. In the limit M, N — > oo with 
a = M/N held fixed, we expect interesting observables, including the free energy, Q 
and AQ to be self-averaging; i.e. the thermal average of these observables over Pq for 
any typical realization of A and s° coincides with their thermal averages over Pq, further 
averaged over A and s°. Thus the typical error Q does not depend on the detailed 
realization of A and s°. We can therefore compute Qq by computing the average free 
energy — f}F = ((log Z))as° using the replica method. 

Details of the replica calculation can be found in [121j . Basically, averaging over 
A in the replicated Gibbs distribution corresponding to the energy (11021) reduces to 
averaging over the variables = -^ a ^ 4 u", where u a , a = 1, . . . ,n are the replicated 
residuals. These variables are are jointly gaussian distributed with zero mean and 
covariance u 5b® 5b b u jj = Qab^^u-, where Q a b = ^ J2f=i ^1^. The replica method yields 
a set of saddle point equations for the overlap matrix Q. Given the convexity of the 
energy function (I102p , it is reasonable to choose a replica symmetric ansatz for the saddle 
point, Q a b = AQ5 a b + Qo- Under this replica symmetric ansatz, further averaging over 
s°, and taking the A — > oo limit, yields a set of self-consistent equations 

Qo =({{«)U)) M . ( 106 ) 

AQ = (((5u 2 } HMF }} z ^. (107) 
Here the thermal average {-) h mf is performed with respect to a Gibbs distribution 

P^( s | s °) = I e -" MF , (108) 

Zj 

with an effective mean-field Hamiltonian 



where we make the substitution s — s° = u. Furthermore the quenched average ((■)) zs o 
denotes an average over a standard normal variable z and the full distribution of the 
signal component s° (given by f'S(s°) + (1 — f)P(s )). 

The relationship between the mean field theory P MF in ffT08]) and the original Gibbs 
distribution Pq in (11031) is as follows. The replica parameters Qo and AQ in (11061) and 
(11071) are identified with the order parameters (11041) and (I105p . Thus solving H106[) and 
(11071) in the zero temperature j5 — > oo limit allows us to compute the typical error Qo 
of CS as a function of a and /. Furthermore, consider the marginal distribution of a 
single signal component in Pq{u) = Pq(s — s°), given the true signal component is 
s°. According to replica theory, the mean field theory prediction for the distribution of 
this marginal is given by 

P G (s k = s\s° k = s°) = ((P MF (s I s°) )) z , (110) 

where P MF (s \ s°) is defined by (11081) and (11091) . and Q and AQ are the solutions to 
(DUD and ffTOTD . 

Now in solving fl 1 6 [) and f ll 07[) in the (3 — > oo limit, one finds two distinct classes 
of solutions |121j depending on the values of a and /. For a > a c (f) one finds solutions 
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in which both AQ and Qq vanish as O(-p). It is expected that thermal fluctuations 
captured by AQ should always vanish in the low temperature limit, but the fact that 
Qo, which captures the typical error of CS, also vanishes, suggests that for a > cx c (f), 
Li minimization should exactly recover the true signal, so that s = s° in (11011) . On the 
otherhand, for a < a c (f), this class of solutions no longer exists, and instead a new 
class of solutions occurs in which AQ is O(^) but Q remains 0(1) as (3 — > oo. This 
class of solutions predicts an error regime in which s ^ s° due to too few measurements. 
Thus replica theory predicts phase a transition between a perfect and an imperfect 
reconstruction regime in the a-f plane, as verified in Fig. |HJA.. This phase boundary was 
first derived in |123[ 1122] using very different methods of convex geometry. 

The phase boundary simplifies in the / — > limit of high sparsity. In this 
limit, ot c {f) = /logl//. This result can be understood from an information theoretic 
perspective. First, the entropy of a sparse signal of dimension N is 0(Nf log l/f). 
Second, assuming that each of our measurements carries 0(1) bits of entropy, and that 
they are not redundant or highly correlated, then the entropy of our measurements is 
O(M). It will not be possible to perfectly reconstruct the signal using any reconstruction 
algorithm whatsoever, if the entropy of our measurements is less than the entropy of 
our signal. The requirement that the measurement entropy exceed the signal entropy 
then yields the inequality a = M/N > 0(/logl//). Thus from the perspective 
of information theory, it is not surprising that we can reconstruct the signal when 
a > a c {f)- What is surprising is that a very simple, polynomial time algorithm, 
L\ minimization, is capable of performing the reconstruction, down to a number of 
measurements that approaches the information theoretic limit at small /, up to constant 
factors. 

What is the nature of this phase transition? For example if we decrease a from 
above a c (f) to below, do we see a catastrophic rise in the error, or does performance 
gracefully degrade? In the language of statistical physics, does Qo(a, f) undergo a first 
or second order phase transition in a? Fortunately, it is a second order phase transition, 
so that Qo rises continuously from 0. The exponent governing the rise depends on the 
distribution of non- zeros P(s°); namely the more confined this distribution is to the 
origin, the shallower the rise (see Fig. EB)- Note that the phase boundary a c (f) in 
contrast is universal, in that it does not depend on the distribution of non-zeros in the 
signal. 

Finally, we can understand the nature of the errors made by CS by looking at 
the distribution the signal reconstruction components conditioned on the true signal 
component. This is of course interesting only in the error regime. To take the zero 
temperature limit we can make the change of variables AQ = ^Aq, and Qo = aqo 
where Aq and qo remain 0(1) as (3 — > oo. Then the mean field Hamiltonian in ( II 9 [) 
becomes 



Since the entire Hamiltonian is proportional to (3, in the large (3 limit, the statistics of 




(111) 
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Figure 8. Compressed sensing analysis. (A) The red curve is the theoretical phase 
boundary a c (f) obtained by solving (|106|) and (|107|) . We also use linear programming 
to solve (|101[) 50 times for each value of a and / in increments of 0.01, with N = 500. 
The black transition region shows when the fraction of times perfect recovery occurs 
is neither nor 1. For all other a > a c (f), we obtained perfect recovery all 50 times, 
and for all other a < a c (f) we never once obtained perfect recovery. The width of this 
transition region narrows as N is increased (not shown), yielding a sharp transition 
in the N — > oo limit at the phase boundary a c (f). (B) Blue points are the average 
Li reconstruction error obtained by solving (|101j) 100 times for each for 4 values of 
/ = 0.2, 0.4, 0.6, 0.8, and various a, with N = 500. Error bars reflect the standard error. 
Red curves are plots of Qo obtained by solving (|106|) and (|107[) in the error phase. (C) 
The soft thresholding function defined in (1113j) . (D) The blue histograms are the 
conditional distribution of nonzero signal reconstruction components obtained from 
solving (jlOll) 2000 times, while the height of the green bar represents the average 
fraction of components that are zero, all conditioned on the value of the true signal 
component s° k . Here N = 500, a = f = 0.2, and P{s°) = ±<S(s° - 1) + ±6(s° + 1). For 
these values of a and /, the order parameters were numerically found to take the values 
qo = 1.06 and Aq = 1.43. The red curves are the theoretically predicted distribution 
of nonzero reconstruction components in (|114|) , while the red dot is the theoretically 
predicted height of the delta function at s = predicted in (|114|) , all conditioned on the 
three possible values of the true signal s°, — 1 (top), (middle) and +1, bottom. Each 
distribution can be thought of as arising from a Gaussian distribution with mean s 
and variance go, fed through the soft threshold function in (C), with a noise threshold 
a = Aq. 
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s are dominated by the global minimum of fillip . In particular, we have 
(s) h mf = v( s° + z y/qt , Aq) , 



(112) 



where 



r)(x,a) 



argmin. - 




+ \s 



) 



= sgn(x)0| -<r)+, 



(113) 



is a soft thresholding function (see Fig. [Hp) which also arises in message passing 
approaches [55J to solving the CS problem in fllOlj) . and (y) + — y if y > and is 
otherwise 0. 

The optimization in ( JTTH]) can be understood intuitively as follows: suppose one 
measures a scalar value x which is a true signal s° corrupted by additive gaussian noise 
with variance a. Under a Laplace prior e~' s °' on the true signal, r](x, a) is simply the 
MAP estimate of s° given the data x, which basically chooses the estimate s = unless 
the data exceeds the noise level a. Thus we see that in (11111) and ( 11121) . s° + z^fq§ plays 
the role of the observed, corrupted data x and Aq plays the role of an effective noise 
level a. 

This optimization has an interpretation within the cavity method jl!9j . It is the 
optimization that a new signal component s added to a cavity system in the absence of 
that component must perform to minimize its total energy in f l 1 03D . This minimization 
reflects a compromise between minimizing its own absolute value, and satisfying all 
the measurement constraints, whose sum total effect is encapsulated by the quadratic 
term in the mean field theory of f illip . The cavity field encapsulating the effect of 
all other measurements will vary from component to component, and the average over 
components can be approximated by the average over the Gaussian variable z, in analogy 
to the SK model in going from fl28|) to fT29|) . 

The distribution of the signal reconstruction components, conditioned on the true 
signal component s° in f lllOp reduces to 



and it reflects the Gaussian distribution of the zero-temperature cavity fields across 
components, fed through the soft-thresholding function which arises from the scalar Li 
minimization problem in f)113p . Here go an d Aq are determined through f 1 1 6 j) and f)107p . 
which can be thought of as a self-consistency condition within the cavity approximation 
demanding that the distribution across components of the cavity field is consistent 
with the distribution across components of the signal reconstruction, in analogy to the 
corresponding self-consistency condition fj29|) in the SK model. An example of the 
match between replica theory and simulations for the signal reconstruction distribution 
is shown in Fig. ED- 

6.3. From Message Passing to Network Dynamics 

The L\ minimization problem in fllOip can also be formulated as a message passing 
problem (551 1124] . and approximate formulations of the message passing dynamical 



P G (s k = s\s° k = 8°) = (( ri(s° + z^q- , Aq) )) g , 



(114) 
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system yield neural network-like dynamics which provide a fast iterative way to solve the 
Li problem. The graphical model consists of iV variable nodes, one for each component 
of the unknown signal Sj, and M degree N factor nodes, one for each measurement, 
plus N more degree 1 factor nodes to implement the Li norm. For example, the Gibbs 
distribution defined by fll02l> and (I103P decomposes as 

M N 

p( 8 ) = UM*)U*~ m > ( 115 ) 
^=1 1=1 

where the factor Vv( s ) is given by 

^(s) = e -&m**-*»") a ^ (116) 

and a M is row /i of the measurement matrix A. This decomposition is in a form 
suitable for the application of the message passing equations (132]) and (|33|) . However, a 
straightforward application of these equations is computationally complex. First, since 
each unknown component Sj is a real number, every message becomes a distribution 
over the real numbers, and one must keep track of MN such messages (the degree 1 
factors do not require associated messages and can be incorporated into the updates of 
the other messages). 

The first approximation made in (55] is to restrict the messages to be Gaussian, 
which is reasonable because the density of the random measurement matrix A implies 
that in each update, each message receives contributions from a large number of 
messages, allowing one to invoke the central limit theorem. Thus one need keep track 
of only 2 numbers for each message, leading to a dynamical system on 2MN variables. 
This system can be further simplified by noting [551 1124] that messages from the same 
variable i to different factors \i are all quite similar to each other; they differ only in 
excluding the effects of one factor /i out of M possible factors. Thus one might assume 
that Mj_j. M = Mj + 0(^=j). A similar argument holds for the factor to variable messages, 
suggesting = M Ai +0(^). By doing a careful expansion in one can then reduce 

the message passing equations to a dynamical system on M + N variables. 

A readable account of this reduction can be found in jl24j . Here we simply quote 
the main result. The dynamical system on M + N variables can be interpreted as an 
iterative update on two pairs of variables, a current estimate s* for the unknown N 
dimensional signal, and the resulting residual r* in M dimensional measurement space. 
The update equations are |124] . 

s* +1 = r?(s* + AV, 6) (117) 
r* =x- As* + for* -1 , (118) 

where 77 is the soft-thresholding function defined in ( 11131) . and here is applied component 
wise to its vector inputs. In |124] . it was shown that if these equations converge, then the 
resulting s°° is a global minimum of the Li energy function in f l 1 2 j) with i = 6(1 — b). 

The crucial term is the message passing derived term involving b in (11181) that 
endows the temporal evolution of the residual with a history dependence. Without this 
term, reconstruction performance is severely impaired. This dynamics can loosely be 
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interpreted as that of a two layer interacting neuronal network, where the residual r* 
is stored in the activity of M neurons in the first layer, while the current estimate of 
the sparse representation s* is stored in N neurons in a second layer. The first layer 
receives feedforward external input x and a top down inhibitory prediction of that input 
through synaptic connectivity A. The second layer neurons receive a feedforward drive 
from the residuals through a synaptic connectivity A T and have a nonlinear transfer 
function 7]. Interestingly, this network dynamics is different from other proposals 
for the implementation of L\ minimization and sparse coding [1H1 HZ]. Given the 
potential role of L\ minimization as a computational description of early visual [56] 
and olfactory [57] processing, it would be interesting to explore more fully the space 
of neuronal architectures and dynamics capable of implementing L\ minimization type 
computations. 

7. Discussion 

We have reviewed the applications of replicas, cavities, and message passing to basic 
models of network dynamics, memory storage, machine learning algorithms, and 
statistical models of data. While the ideas and models reviewed yield a rich and 
sometimes surprisingly striking picture, it is natural to ask how the above results are 
modified when assumptions in the models are relaxed, and what theoretical progress 
can be made through a statistical mechanics based analysis of these more complex 
scenarios. Here we will briefly discuss a few more prominent examples. However, we 
warn the reader that in this discussion we will barely be scratching the surface of a 
deep literature lying at the intersection of statistical mechanics, computer science and 
neuroscience. 

7.1. Network Dynamics 

First, in section [21 when we introduced the SK model and the Hopfield model, we 
were considering models of neuronal networks that had many simplifying assumptions, 
including: symmetric connectivity, binary neurons, and lack of external inputs. What 
happens for example, when the connectivity becomes asymmetric? Then there is 
no simple form for the stationary distribution of neuronal activity analogous to the 
equilibrium Gibbs distribution in fl2]). One must then posit a dynamical model for the 
network dynamics, and in many situations, one can use dynamic mean field theory 
methods |125[ 1126] to understand time averaged asymptotic statistical properties of 
neuronal activity. This was done in for example in [1271 11281 1129] for asymmetric 
networks. Interestingly, while fully asymmetric networks become ergodic, partially 
asymmetric networks retain fixed points, but the time it takes for transients to reach 
these fixed points can diverge exponentially with network size. 

Moreover dynamical versions of the SK and Hopfield models have a Lyapunov 
function that is bounded from below, implying that the long time asymptotic behavior 
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at zero temperature consists of fixed points only. In asymmetric networks, two more 
dynamical possibilities arise: oscillations and chaos. Seminal work showed analytically 
(via dynamic mean field theory) and through simulations that neuronal networks can 
exhibit deterministic high dimensional chaos [2TJ I130[ 1131] . Even when driven by a 
constant current input, such networks exhibit chaos by dynamically achieving a balance 
between excitatory and inhibitory inputs to individual neurons. This balance leads to 
spontaneous irregular neural activity characteristic of cortical states, in which neurons 
spike due to fluctuations in their input, as opposed to a mean superthreshold input 
current. Interestingly, when the inputs have more nontrivial temporal structure, such 
networks exhibit a sharp phase transition from a chaotic state to an ordered state, that 
is entrained by the input, as the input strength increases. This happens for example 
when the external input is either noisy |132] or oscillatory |133] . In the case of oscillatory 
input there is an interesting non-monotonic dependence in the input strength at which 
this phase transition occurs, as a function of oscillation frequency |133] . 

In section [2} we discussed binary models of neurons. However, biological neurons 
are characterized by analog internal states describing membrane voltage and ion channel 
conductance states, and their dynamics exhibits large spiking events in their membrane 
voltage. Dynamic mean field theory methods can be extended to spiking networks and 
were used to characterize the phase diagram of networks of excitatory and inhibitory 
leaky integrate and fire neurons |134j . For such neurons, whose internal state is 
characterized solely by a membrane voltage, one can derive an appropriate mean 
field theory by maintaining a distribution of membrane voltages across neurons in the 
network, and self-consistently solving for this distribution using Fokker-Planck methods 
(see |135l 1136] for reviews). This work |134] lead to four possible macroscopic phases 
of network dynamics, characterized by two possibilities for the temporal statistics of 
single neurons (regular periodic spike trains, or irregular aperiodic spike trains) times 
two possibilities for the population average firing rates (synchronous or temporally 
structured rates, or asynchronous or constant rates). Varying strengths of excitation, 
inhibition, and single neuron properties allow all four combinations to occur. 

More recently, in |1371 1138] the authors went beyond mean field theory to track 
entire microstate trajectories in spiking neural networks consisting of neurons in which 
it is possible to analytically compute the time of the first neuron to spike next, given 
the internal state of all neurons in the network. This allowed the authors to perform 
numerically exact computations of the entire spectrum of Lyapunov exponents by 
computing products of Jacobians associated with every future spike starting from an 
initial condition. They found classes of networks that exhibit extensive chaos, in which 
a finite fraction of all Lyapunov exponents were positive. Moreover, they showed that 
the Lyapunov spectrum is highly sensitive to the details of the action potential shape, 
as positive feedback effects associated with the rise of the action potential contribute 
most heavily to the divergence of microstate trajectories. Even more interestingly, the 
authors found "flux" tubes of stability surrounding trajectories: small perturbations to 
the network state decayed quickly, whereas larger perturbations lead to an exponential 
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divergence between trajectories. Thus each trajectory is surrounded by a stability 
tube. However, the radius of this tube shrinks with the number of neurons, N. 
This reveals that the calculation of Lyapunov exponents in spiking networks in the 
thermodynamic (N — > oo) limit is extremely subtle, due to the non- commutation of 
limits. Computing Lyapunov exponents requires taking a small perturbation limit, 
which if taken before the thermodynamic limit would yield negative exponents, but 
if taken after the thermodynamic limit, would yield positive exponents. In any 
case, injecting extra spikes into the network constitutes a large perturbation even at 
finite N, which leads to a divergence in trajectories. This picture is consistent with 
recent experimental results suggesting that the injection of extra spikes into a cortical 
network leads to a completely different spiking trajectory, without changing the overall 
population statistics of neural activity |139j . More generally, for reviews on network 
dynamics in neuroscience, see [140] 1141] . 

7.2. Learning and Generalization 

In the beginning of sectional we considered the capacity of simple network architectures 
to store, or memorize a set of input-output mappings. While memory is certainly 
important, the goal of most organisms is not simply to memorize past responses to past 
inputs, but rather to generalize from past experience in order to learn rules that can yield 
appropriate responses to novel inputs the organism has never seen before. This idea has 
been formalized in a statistical mechanics framework for the perceptron in [1421 1143] . 
Here the P training inputs and outputs are no longer random, but are generated from a 
"teacher" perceptron. The observable of interest then becomes the generalization error 
e g (a), which is by definition the probability that the trained perceptron disagrees with 
the teacher perceptron's correct answer on a novel input, not present in the training set. 
Here a = is the ratio of the number of training examples to the number of synapses N. 
For a wide variety of training procedures, or learning algorithms, statistical mechanics 
approaches found that e g (a) decays as O(-) for large a, indicating that the number of 
examples should be proportional to the number synapses in order for good generalization 
to occur. 

The perceptron, while acting in some sense as the Drosophila of statistical 
learning theory, is a very limited architecture in that it can only learn linearly 
separable classifications in which the two classes fall on opposite sides of a hyperplane. 
Statistical mechanics approaches have been used to analyze memory [144] 1145] 1146] 
and generalization [1471 1148] 1149] in more sophisticated multilayered networks. In a 
multilayered network, only the input and output layers are constrained to implement 
a desired mapping, while the internal, hidden layer activities remain unspecified. This 
feature generically leads to replica symmetry breaking, where the space of solutions 
to a desired input-output mapping breaks into multiple disconnected components, 
where each component corresponds to a different internal representation of hidden layer 
activities capable of implementing the desired mapping. Statistical mechanics has also 
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had success in the analysis of learning in other architectures and machine learning 
algorithms, including support vector machines |150[ I151[ 1152] , and Gaussian processes 
[153]. 

Another generalization of the perceptron is the tempotron, an architecture and 
learning rule capable of learning to classify spatiotemporal patterns of incoming spikes 
[154] . The tempotron can be trained to fire a spike for one class of input spike 
time patterns, and no spikes for another class, while the precise timing of the output 
spike can be left unspecified. A statistical mechanics analysis of a simplified binary 
tempotron was carried out in |155] . Interestingly, the space of solutions in synaptic 
weight space to any given spike time classification problem can be well described 
by a one-step replica symmetry broken phase shown schematically in Fig. [Tp. Each 
component corresponds to a different output spike time for the positive classifications, 
in direct analogy to the replica symmetry broken phase of multilayered networks in 
which each component corresponds to a different internal representation. The various 
solution components are both small (implying that very similar weights can yield very 
different classifications) and far apart (implying that very different weights can yield 
an identical classification). The authors verified that these properties persist even in 
a more biologically realistic Ho dgkin- Huxley model of a single neuron |155j . Overall 
this reveals a striking double dissociation between structure (synaptic connectivity) and 
function (implemented classification) even at the level of single neurons. This double 
dissociation has important implications for the interpretation of incoming connectomics 
data |156j . More generally, for reviews on applications of statistical mechanics to to 
memory, learning and generalization, see [121 H57] . 

7.3. Machine Learning and Data Analysis 

Starting in the latter part of section [3j we turned our attention to the statistical 
mechanics based analysis of machine learning algorithms designed to extract structured 
patterns from data, focusing on illusions of structure returned by such algorithms when 
applied to high dimensional noise. In real data analysis problems, we have to protect 
ourselves from such illusions, and so understanding these illusions present in pure noise 
is an important first step. However, we would ideally like to analyze the performance 
of machine learning algorithms when data contain both structured patterns as well 
as random noise. A key question in the design and analysis of experiments is then 
how much data do we need to reliably uncover structure buried within noise? Since, 
in the statistical mechanics based analysis of learning algorithms, the data plays the 
role of quenched disorder, we must analyze statistical mechanics problems in which the 
quenched disorder is no longer simply random, but itself has structure. 

This has been done for example in |158[ 196] for PCA applied to signals confined 
to a low dimensional linear space, but corrupted by high dimensional noise. The data 
is then drawn from a covariance matrix consisting of the identity plus a low rank part. 
A replica based computation of the typical eigenvalue spectrum of empirical covariance 
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matrices for data of this type revealed the presence of a series of phase transitions as 
the ratio between the amount of data and its ambient dimensionality increases. As 
this ratio increases, signal eigenvalues associated with the low rank part pop out of a 
Marcenko-Pasteur sea (i.e. Fig|6A) associated with the high dimensional noise. Thus 
this work reveals sharp thresholds in the amount of data required to resolve signal 
from noise. Also, interesting work has been done on the statistical mechanics based 
analysis of typical learning outcomes for other structured data settings, including finding 
a direction separating two Gaussian clouds |159[ 1160] , supervised learning from clustered 
input examples [161] , phase transitions in clustering as a function of cluster scale |162] , 
and learning Gaussian mixture models [1631 1164] . Moreover, statistical mechanics 
approaches to clustering have yielded interesting new algorithms and practical results, 
including superparamagnetic clustering |165[ 1166] . based on an isomorphism between 
cluster assignments and Potts model ground states, and a method for computing p- values 
for cluster significance [167] . using extreme value theory to compute a null distribution 
for the maximal number of data points over all regions in feature space of a given size. 

In section 15.31 we initiated a statistical mechanics based analysis of random 
dimensionality reduction by connecting the maximally incurred geometric distortion 
to correlated extreme value theory. For the simple case of point clouds, correlations 
could be neglected, while for hyperplanes, the correlations arose from fluctuations in 
the Coulomb gas interactions of eigenvalues of random matrices, and could be treated 
exactly. It would be interesting to study more complex manifolds. For example, rigorous 
upper bounds on the maximal distortion were proven in |104 j by surrounding arbitrary 
manifolds and their tangent planes by a scaffold of points, and then showing that if 
the geometry of this scaffold remains undistorted under any projection, then so does 
the geometry of the manifold. An application of the JL lemma to the scaffold then 
suffices to obtain an upperbound on the distortion incurred by the manifold under a 
random projection. To understand how tight or loose this upperbound is, it would useful 
to compute the typical distortion incurred by more complex manifold ensembles. For 
example, for manifolds consisting of unions of planes, one would be interested in the 
fluctuations of the maximal eigenvalue of multiple correlated matrices, corresponding 
to the restriction of the same random projection to each plane. Thus results from the 
eigenvalue spectra of random correlated matrices |168l 1169} 1170] could become relevant. 

Finally, we note that throughout most of this review we have focused on situations 
in which replica symmetry holds, though we have noted that several neuronal and 
machine learning problems, including multilayer networks, tempotrons, and clustering, 
are described by replica symmetry broken phases in which the solution space breaks up 
into many clusters, as well as suboptimal, higher energy metastable states. As noted 
at the end of section I2.4[ statistical mechanics based approaches have inspired a new 
algorithm, known as survey propagation (77J [78] that can find good solutions despite 
the proliferation of metastable states whose presence can confound simpler optimization 
and inference algorithms. Despite the power of survey propagation, its applications to 
neuroscience remain relatively unexplored. 
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In summary, decades of interactions between statistical physics, computer science, 
and neuroscience, have lead to beautiful insights, both into how neuronal dynamics leads 
to computation, as well as how our brains might create machine learning algorithms to 
analyze themselves. We suspect that further interactions between these fields are likely 
to provide exciting and insightful intellectual adventures for many years to come. 
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8. Appendix: Replica Theory 

8.1. Overall Framework 

Suppose we wish to do statistical mechanics on a set of N thermal degrees of freedom 
encoded in the N dimensional vector x, where the components are coupled to each 
other through some quenched disorder D, in a Hamiltonian H(x, D). In the above 
applications, x could be the spins s in a spin glass (then D is the connectivity matrix 
J), the synaptic weights w of a perceptron (then D is the set of examples to be stored), 
the variables u in the Stieltjes transform of an eigenvalue spectrum (then D is a random 
matrix) , or the residuals u in a compressed sensing problem (then D is the measurement 
matrix). As discussed above, to properly average over the quenched disorder D, we must 
average the replicated partition function 

<u n » D = ((/ m =1 dx a e-£:^ xa > D ))) D (H9) 

Conditioned on any particular realization of the quenched disorder D, the different 
replicated degrees of freedom x a are independent. However, integrating out the quenched 
disorder introduces interactions among the replicated variables. In all of the above 
applications, the resulting interactions depend only on the overlap matrix between 
replicas, defined as Q ab = ^x a • x fe . More precisely, the following identity holds, 

e-EL^^D)^ =e -NE(Q)^ (12Q) 

for some function E over the overlap matrix Q. Therefore it is useful to separate the 
remaining integral over x a in (1 1 1 9 [) into an integral over all possible overlaps Q ab , and 
then all possible x a configurations with a prescribed set of overlaps by introducing a 
5-function: 

«^ n »D = / U d Qa b e- NE(Q) J f[ d^Y[6[^-^ b -NQ ab ]. (121) 

ab a=l ab 

The integral over x a with a fixed set of overlaps Q ab can be carried out by introducing 
the exponential representation of the 5 function, 

5[x a • x b - NQ ab ] = J dQ ab e -$.6(*»^-iVQ ai ) j (122) 



( 
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where the integral over Q ab is understood to be along the imaginary axis. Inserting 
(j!22|) into (I12ip decouples the components of the vectors x a , yielding an integral over 
n scalar variables x a raised to the iV'th power. This final result can be written as 

((Z n )) B = J J] dQ ab dQ ab e- N ^^ Q ^ b+G ^\ (123) 



where 

G(Qab) = - In / II dx a e- H "&""* n \ (124) 

"* a 

is the partition function of an effective Hamiltonian 

H eS = J2x a Qa b x b . (125) 

ab 

Now in the large N limit, the final integrals over Q ab and Q ab can be done via the 
saddle point method, yielding a set of self-consistent equations for the saddle point by 
extremizing the exponent in (11231) : 

OE 

Qab = J^- (126) 

Qab = (x a x b ) n , (127) 

where (-) n denotes an average with respect to a Gibbs distribution with effective 
Hamiltonian H e g in HI 251) . 

In general both these equations must be solved in the n — > limit. Now in the case 
where x" are real valued- variables (as opposed to binary variables in the SK model), 
these equations can be further simplified because the integral over x a in (11241) can be 
done exactly, since it is Gaussian, and furthermore, the extremum over Q ab in (j!24p can 
be performed. Together, this yields an entropic factor (up to a multiplicative constant 
involving n and N), 

J ft dx a Jp[x a ■ x b - NQ ab ] = e NS ®\ (128) 



a=l ab 



where 



S(Q) = ^TrlogQ (129) 

represents (up to an additive constant) the entropy of replicated configurations x a with 
a prescribed overlap matrix Q. (11231) reduces to 

<u n » D = / n^«* e "^ w) " 5W))I . ( i3 °) 

ab 

and the saddle point overlap configuration Q represents a compromise between energy 
and entropy extremization in the exponent of (I130p . 



CONTENTS 



61 



8.2. Physical meaning of overlaps 

Here we make the connection between the replica overlap matrix Q ab and the disorder 
averaged distribution of overlaps, P(q), of two states x 1 and x 2 both drawn from a 
Gibbs distribution with Hamiltonian if(x, D). For a given realization of the disorder, 
the overlap distribution is 

PM = J dx 2 5 (q - Ix 1 ■ x 2 ) e -*C#*»-***» (131) 

where Z(D) = J dxe _I ' x ' D '. Averaging Pn(q) over the disorder is difficult because 
D appears both in the numerator and denominator of (11311) . To circumvent this, one 
can introduce replicas via the simple identity Z~ 2 = lim^o Z n ~ 2 . Using this, one can 
perform the easier average at integer n > 2, and then take the limit n — > at the end. 
Thus 

P(q) = ((P»(q))) u (132) 
= hm((/rX =1 ( fxVEL 1 ^,D) % _i x i. x ^ D (133) 

Here x 1 and x 2 are the original degrees of freedom with n — 2 additional replicas added to 
yield Z n ~ 2 . One can then average the right hand side of (11331) over D using a sequence 
of steps very similar to section 18.11 The final answer yields 

P{q) = lim — ^— £ 5(q - Q ab ) (134) 
n->o n {n - 1) ^ 

where Q ab is the saddle point replica overlap matrix. In situations where replica 
symmetry is broken, there will be multiple equivalent saddle points related to each 
other by the action of the permutation group on the replica indices a, b. The sum over 
these saddle points yields the sum in ( 11340 . In summary, the probability that two states 
have overlap q is, according to replica theory, equal to the fraction of off-diagonal matrix 
elements Q a b that take the value q. 

8.3. Replica symmetric equations 

Here we show how to take the n — > limit for various problems, in the replica symmetric 
approximation. We use Einstein summation convention in which repeated indices are 
meant to be summed over. 

8.3.1. SK Model We will now apply (JT23I) - (1X271) to the SK model from section El As 
we saw in (TTOj) . we have 

E{Q) = - (|) Ql b - (135) 
Then, ( TT261) gives 

Qab = - ^rrQab, (136) 




E(Q) - QabQab = ['-) Ql b - (137) 
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We make the following replica-symmetric ansatz for the saddle point: 

Qab = q+(l-q)S ab , (138) 
where we used the fact that ( TTT1) guarantees that Q aa = 1. We will determine q by 



minimizing the free energy. This leads to 

2 



E(Q) - Q ab Q 



ab 



(~[ 2\ , 2 2 

(1 — q )n + q n 



H, 



off 



2 



(139) 
(140) 



We can now evaluate (I124p using the identity (Q with a = 1: 



G{Q) 



(l-q)n+q(Y /a S a 

1 — q)n — In 



In 
2 

— (1 — q)n — In ^ 2 cosh(/3 v /g^) 



This gives us the free energy density: 



N //J 



1 A 

N dn 



((Z n ))j 



n=0 



ln[2cosh(/3^)])) . (141) 
As mentioned above, we determine q by minimizing this. We will need the identity 

<<*/(*)>>.= <</'(*)>>,. 
which can be derived with integration by parts. We find 
d 



dqW N Hi 



y(l-?)-^((^tanh(/3^) 



= y (l-g-((sech 2 (/3^) 

= ^(((tanh 2 (/3^))) z -g), 
therefore, the minimum satisfies ( IT41 . 



(142) 



8.3.2. Perceptron and Unsupervised Learning The starting point for the learning 
applications discussed here is the energy fl60l and entropy f)129p in the replicated 
partition function ( 11301) . These can be derived by following sections 13.31 and 18.11 Here 
we take the n — > limit. For the energy, we obtain 



hm£(Q) 

n— >0 



-a . 



W n 



d\ a 



o=l 



2tt Vdetg 



(143) 
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where 



-a In / f[ ^ ^ c iXaX ~ a ~* XaQ " hXh ~ T - 0V(Xa) 

a=l 

-alu / f[ dXa dXa e <A.X;-^i- g )E.(Xa) 3 -§(VgE tt <.) a -E.^j 

a=l 

-a In ((C)), (145) 
-n«((lnC)) z , (146) 



_ cfAcfA i X(A-v?z)-i(i- g )A 2 - ( 9^(A) 

e ~2^- q WW ( 147 ) 



'27r(l - g) 

is the partition function of a distribution whose interpretation will be given in section 

El 

In going from (I143P to (11441) we used the identity 



and inserted the replica symmetric ansatz Q a 6 = (1 — q)5 a h + q. Then the only coupling 
between the various A a 's in (I144p occurs through the term |(v/gEa -^a) 2 - We can thus 
decouple the A a variables at the expense of introducing a Gaussian integral via the 
identity e - ^ 2 = (( e lbz ^ , where z is a zero mean, unit variance Gaussian variable and 
(('))z denotes an average with respect to z. This transformation yields ( I145p . and, as 
n -> 0, ( 1147)1) . 

Now for the entropy, we obtain 

lim S(Q) = lim - Tr log Q (149) 

n— >0 n— >0 2 



+ ln(l - q) 



. -.- „ (150) 

.1 - g J 

Here we have used the fact that the replica symmetric Q has 1 eigenvalue equal to 
1 + (n — l)q and n — 1 eigenvalues equal to 1 — q. Finally, inserting (I146p and ( jl 501) 
into (11301) and performing the integration over g via a saddle point yields a saddle point 
equation for q corresponding to extremizing F(q) in ( |62l) . 



S.^. Distribution of Alignments 

Suppose we wish to compute the probability distribution across examples \x of the 
alignment of each example £ M with an optimal weight vector w derived from an 
unsupervised learning problem. Alternatively, one can think of this as the distribution 
of the data projected onto the optimal dimension. This distribution is 

^) = ^E*(A-A"), (151) 
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where A M = -jLw-^, an d w is drawn from the distribution (15^|) . For large iV and P we 
expect this distribution to be self-averaging, so for any fixed realization of the examples, 
it will be close to 



P(A) 



where 



(152) 



(153) 



and (( ■ )) denotes an average over the examples This average is hard to perform 
because the examples occur both in the numerator and denominator. This difficulty can 
be circumvented by introducing replicas via the simple identity -| = lim n _>.o l Z n ~ x . Thus 



P(A) = lim (7 / n"=i dw a 8(X - X\) e~^"=^ v(K) 



n->0 



(154) 



where A^ = ~^ w a-£, fl - Here the first replica plays the role of the numerator in ( I152p 
and replicas 2, . . . , n play the role of 4 in the n — > limit. Now we can introduce an 
integral representation of 5(X — X\), perform the Gaussian average over A^ and take the 
n —7- limit using a sequence of steps very similar to those in sections 18.11 and 18.3.21 
This yields 



P(A) 



1 (\-sfqzY 

e 2 



C y/2n(l-q) 

where ( is the partition function given by fl63|) and g extremizes the free energy (162]) 



(155) 



(§.5. Inverting the Stieltjies Transform 

It is helpful to think of (!70|) as a complex contour integral, with the contour running 
along the real axis. We can't simply set e = in ( ITT]) , as the pole at z' = z + ie would 
hit the contour. However, Cauchy's theorem tells us that we can deform the contour 
without changing the integral, provided that it doesn't cross any singularities. We will 
use the following contour, which takes a semicircle detour below the singularity: 



x. 



C{5): 



z + 5 e 



x. 



x e (— oo, —5], 
o e [-tt,o], 

x G [5, oo). 



It will help to take the limit 8 — > 0. 
We can write 



lim — Im Rw(z + ie) 

e->0+ TT 



lim — Im 

<5->0+ 7T 

lim — Im 

<5-S>0+ 7T 



dz 



, Pw{z 



C(5) 



dx 



z' — z 
Pw(x) 



X 



+ 



'd0 (i5e ld ) 



z 



Pw 



(z + 5< 



,i0 



5e 



+ 



x — z 
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The first and third terms diverge as S — > 0. However, their sum is finite. It is referred 
to as the Cauchy principal value of the integral. It also happens to be real, and we are 
only interested in the imaginary part. This leaves the second term: 

lim —ImRw(z + ie)= lim — Im / d#ipw (z + 5 e ie ) = pw(z). 

e->0+ 7T 5^0+ 7T J -tt V ' 
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